gewu-lab / music-avqa Goto Github PK

View Code? Open in Web Editor NEW

65.0 65.0 7.0 377.83 MB

MUSIC-AVQA, CVPR2022 (ORAL)

License: MIT License

Python 100.00%

music-avqa's People

Contributors

Stargazers

Watchers

Forkers

ksblk2116 ayameyao huangjh98 jacksonfgold hegdekartik inesriahi he-shuwei

music-avqa's Issues

Question padding

In the following section of the code in dataloader_qa_grd_baseline.py,

if len(question) < self.max_len:
            n = self.max_len - len(question)
            for i in range(n):
                question.append('<pad>')

are you padding the question with extra tokens? and if so, did you not find a question with length more than 14? Rather I would like to ask what happens if question length is more than 14. Also, is it related to the 14*14 visual features.

"Cannot load" some videos when extracting audio from video

Thanks for the awesome dataset as well as open sourcing everything!

I found that for many videos, when I extract the audio from the video using your script I get: cannot load MUSIC-AVQA-videos-Real/00001835.mp4.

Is this something that also happened on your end? Also, could you please release the audio files separately?

Thanks again!

What's the function of the "pure" variable?

Thanks for your great work! I'm reading the code, but have some issues about it.
In net_avst.py, line 125, you assigned the audio_feat to audio_feat_pure, but the audio_feat haven't been changed before line 207. It seems that the "pure" variable doesn't work in fact. So it's just for indicating that the audio_feat used in Temporal Grounding Module is "pure"?

In a word, will the rename operation influence the gradient flow? I noticed that the shape of audio_feat_pure is [B, T, C], and the shape of audio_feat is [B*T, C]. But the pointers of them are the same (in line 206 of net_avst.py). Maybe I can use audio_feat in line 206 directly.

The size of extracted 14*14 features is not "[320, 512, 14, 14]"

Hello, thank you for your great work, but I met some problems when I ran the code.
I used your "extract_14x14_feat.py" to extract 14x14 visual feature, but the size of extracted "0000xxxx.npy" is [4, 512, 14, 14]. Therefore, when I ran the "main_avst.py" file, we met the problem "IndexError: index 5 is out of bounds for dimension 0 with size 4".
I found the size of 'selected_image' in 'extract_14x14_feat.py' is [4, 3, 224, 224], how can I solve the problem and run the code successfully?

Misalignment between audio and video frames

Thank you so much for your fantastic work!
I found a few videos shorter than 60s in your dataset. When using your frame extraction script to extract frames from a video in the 1fps manner, I could not get 60 frames, however, the shape of the corresponding audio feature was [60, 128] in vggish folder.
It would be so grateful if you let me know how to align the audio and frames from the same video.

关于复现时的一个问题

您好，感谢您的出色工作，请问我在训练时出现这样一个错误是什么原因呢？谢谢！：
Traceback (most recent call last):
File "net_grd_avst/main_avst.py", line 275, in
main()
File "net_grd_avst/main_avst.py", line 258, in main
train(args, model, train_loader, optimizer, criterion, epoch=epoch)
File "net_grd_avst/main_avst.py", line 46, in train
for batch_idx, sample in enumerate(train_loader):
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [11, 512, 14, 14] at entry 0 and [10, 512, 14, 14] at entry 1

此问题已解决！之前我直接用了the extracted frames (1fps)，但是这里面提取的帧数有62和60的，我重新在原视频中提取后得到的就是60的了。

Have you tried to use a bigger text encoder?

I have tried to use a pretrained BERT to encode the text, but no matter how I tune the hyperparameters, the accuracy is around 50%. It seems that these text shouldn't be encoded with a complex encoder? Could you please tell me your opinion on this phenomenon?

gewu-lab / music-avqa Goto Github PK

music-avqa's People

Contributors

Stargazers

Watchers

Forkers

music-avqa's Issues

Question padding

"Cannot load" some videos when extracting audio from video

What's the function of the "pure" variable?

The size of extracted 14*14 features is not "[320, 512, 14, 14]"

Misalignment between audio and video frames

关于复现时的一个问题

Have you tried to use a bigger text encoder?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent