Code Monkey home page Code Monkey logo

music-avqa's People

Contributors

ayameyao avatar dtaoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

music-avqa's Issues

Question padding

In the following section of the code in dataloader_qa_grd_baseline.py,

if len(question) < self.max_len:
            n = self.max_len - len(question)
            for i in range(n):
                question.append('<pad>')

are you padding the question with extra tokens? and if so, did you not find a question with length more than 14? Rather I would like to ask what happens if question length is more than 14. Also, is it related to the 14*14 visual features.

"Cannot load" some videos when extracting audio from video

Thanks for the awesome dataset as well as open sourcing everything!

I found that for many videos, when I extract the audio from the video using your script I get: cannot load MUSIC-AVQA-videos-Real/00001835.mp4.

Is this something that also happened on your end? Also, could you please release the audio files separately?

Thanks again!

What's the function of the "pure" variable?

Thanks for your great work! I'm reading the code, but have some issues about it.
In net_avst.py, line 125, you assigned the audio_feat to audio_feat_pure, but the audio_feat haven't been changed before line 207. It seems that the "pure" variable doesn't work in fact. So it's just for indicating that the audio_feat used in Temporal Grounding Module is "pure"?

In a word, will the rename operation influence the gradient flow? I noticed that the shape of audio_feat_pure is [B, T, C], and the shape of audio_feat is [B*T, C]. But the pointers of them are the same (in line 206 of net_avst.py). Maybe I can use audio_feat in line 206 directly.

The size of extracted 14*14 features is not "[320, 512, 14, 14]"

Hello, thank you for your great work, but I met some problems when I ran the code.
I used your "extract_14x14_feat.py" to extract 14x14 visual feature, but the size of extracted "0000xxxx.npy" is [4, 512, 14, 14]. Therefore, when I ran the "main_avst.py" file, we met the problem "IndexError: index 5 is out of bounds for dimension 0 with size 4".
I found the size of 'selected_image' in 'extract_14x14_feat.py' is [4, 3, 224, 224], how can I solve the problem and run the code successfully?

Misalignment between audio and video frames

Thank you so much for your fantastic work!
I found a few videos shorter than 60s in your dataset. When using your frame extraction script to extract frames from a video in the 1fps manner, I could not get 60 frames, however, the shape of the corresponding audio feature was [60, 128] in vggish folder.
It would be so grateful if you let me know how to align the audio and frames from the same video.

关于复现时的一个问题

您好,感谢您的出色工作,请问我在训练时出现这样一个错误是什么原因呢?谢谢!:
Traceback (most recent call last):
File "net_grd_avst/main_avst.py", line 275, in
main()
File "net_grd_avst/main_avst.py", line 258, in main
train(args, model, train_loader, optimizer, criterion, epoch=epoch)
File "net_grd_avst/main_avst.py", line 46, in train
for batch_idx, sample in enumerate(train_loader):
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/uestcers/uestc1/.conda/envs/music/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [11, 512, 14, 14] at entry 0 and [10, 512, 14, 14] at entry 1

此问题已解决!之前我直接用了the extracted frames (1fps),但是这里面提取的帧数有62和60的,我重新在原视频中提取后得到的就是60的了。

Have you tried to use a bigger text encoder?

I have tried to use a pretrained BERT to encode the text, but no matter how I tune the hyperparameters, the accuracy is around 50%. It seems that these text shouldn't be encoded with a complex encoder? Could you please tell me your opinion on this phenomenon?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.