Code Monkey home page Code Monkey logo

uavm's Introduction

UAVM: Towards Unifying Audio and Visual Models

News

This paper will be presented as an oral paper at the ICASSP Audio for Multimedia and Multimodal Processing Session at 6/6/2023 10:50:00 (Eastern European Summer Time).

Introduction

Illustration of AST.

This repository contains the official implementation (in PyTorch) of the models and experiments proposed in the IEEE Signal Processing 2022 paper UAVM: Towards Unifying Audio and Visual Models (Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass; MIT CSAIL). [IEEE Xplore] [arxiv]

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Performance-wise, our model and training pipeline achieves good results (SOTA on VGGSound):

PWC PWC

To help better understand the pros and cons of this work, we have attached the anonymous reviews and our responses here. We thank the associate editor and anonymous reviewers' invaluable comments.

Citing

Please cite our paper if you find this repository useful.

@ARTICLE{uavm_gong,
  author={Gong, Yuan and Liu, Alexander H. and Rouditchenko, Andrew and Glass, James},
  journal={IEEE Signal Processing Letters}, 
  title={UAVM: Towards Unifying Audio and Visual Models}, 
  year={2022},
  volume={29},
  pages={2437-2441},
  doi={10.1109/LSP.2022.3224688}}

Getting Started

Prepare the Environment

Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd uavm/ 
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt 

Where is the code?

  1. The UAVM model, modality-independent model, and cross-modality attention model scripts are in src/models/mm_trans.py (UnifiedTrans, SeperateTrans, and FullAttTrans).
  2. The training and test scripts are in src/run.py and src/traintest.py.
  3. The ConvNext audio feature extractor model script is in src/models/convnext.py.
  4. The ConvNext audio feature extractor model training scripts are in src/run_feat.py and src/traintest_feat.py.
  5. Audio and visual feature extraction scripts are in src/gen_audio_feat/ and src/gen_video_feat/, respectively.
  6. The VGGSound recipe is in egs/vggsound.
  7. The AudioSet recipe is in egs/audioset.
  8. The probing test and plot scripts for UAVM paper figure 3, 4, and 5 are in src/{embedding, retrieval, attention_diff}, respectively.

VGGSound Recipe

This part of code is in egs/vggsound/.

Option 1 (Recommended) One-Click Recipe

We recommend to start with one_click_recipe.sh that loads our pretrained features, do all preprocessing, and train the UAVM model first. It should report a SOTA accuracy on VGGSound.

Option 2. Step-by-Step Recipe to Do Everything from Scratch

Alternatively, you can also start from scratch. We provide everything you would need to do so. Note: You can skip step 0-4 and directly go to step 5 if you don't want to download the dataset by yourself and train your own feature extractor.

  1. Download the VGGSound dataset, create json datafiles to include the wav path, see samples at [here], generate sample weight file using ./datafile/gen_weight_file.
  2. Train an audio feature extractor with the VGGSound audio data using run_feat_extractor.sh, which will call src/run_feat.py, which will call src/traintest_feat.py. (You can skip this step by using our pretrained feature extractor from [this link]).
  3. Generate audio features and save them on disk using src/gen_audio_feat/gen_audio_feature.py. (You can skip step 1&2 by using our pretrained audio features from [this link]).
  4. Generate visual features and save them on disk using scripts in src/gen_video_feat. (You can skip this step by using our pretrained visual features from [this link]).
  5. Create json datafiles to include the path of audio/visual feature path and label information, see samples at [here], generate sample weight file using ./datafile/gen_weight_file.
  6. Train the uavm, cross-modality attention model, and modality-independent model using run_uavm.sh, run_fullatt_trans.sh, and run_ind_trans.sh, respectively.

AudioSet Recipe

This part of code is in egs/audioset/.

The AudioSet recipe is almost identical with the VGGSound recipe, expect that we cannot provide pretrained features.

  1. Download the AudioSet dataset, create json datafiles to include the wav path, see samples at [here], generate sample weight file using ./datafile/gen_weight_file.
  2. Train an audio feature extractor with the AudioSet audio data using run_feat_extractor.sh, which will call src/run_feat.py, which will call src/traintest_feat.py. (You can skip step 1&2 by using our pretrained audio feature extractor [here])
  3. Generate audio features and save them on disk using src/gen_audio_feat/gen_audio_feature.py. (For full AudioSet-2M, this process needs to be parallelized so that it can be finished in reasonable time).
  4. Generate visual features and save them on disk using scripts in src/gen_video_feat. (For full AudioSet-2M, this process needs to be parallelized so that it can be finished in reasonable time).
  5. Create json datafiles to include the path of audio/visual feature path and label information, see samples at [here], generate sample weight file using ./datafile/gen_weight_file.
  6. Train the uavm, cross-modality attention model, and modality-independent model using run_uavm.sh, run_fullatt_trans.sh, and run_ind_trans.sh, respectively.

Audio-Visual Embedding Analysis (Figure 3 of the UAVM Paper)

This part of code is in src/embedding/.

This is to reproduce Figure 3 of the UAVM Paper which analyzes the embedding of the unified models.

  1. Generate the intermediate representations of a model using gen_intermediate_representation.py (this script does more than that, but you can ignore other functions, we have also released these intermediate representations if you don't want to do yourself, please see below).
  2. Build a modality classifier and record the results using check_modality_classification_summary.py, which will generate a modality_cla_summary_65_all.csv (65 is just internal experiment id, you can ignore it, we have include this file in the repo).
  3. Plot the modality classification results using plt_figure_2_abc.py (Figure 2 (a), (b), and (c) of the UAVM paper).
  4. Plot the t-SNE results using plt_figure_2_d.py (Figure 2 (d) of the UAVM paper).

Note: you can of course train your own model using our provide scripts and do analysis based on the models. But for the purpose of easier reproduction and analysis without re-training the models, we also release all models and corresponding intermediate representations of the models [here], it is very large at around 30GB.

Audio-Visual Retrieval Experiments (Figure 4 of the UAVM paper).

This part of code is in src/retrieval/.

1.Generate the retrieval performance results R@5 and R@10 using gen_rep_vggsound_retrieval2_r5_summary.py and gen_rep_vggsound_retrieval2_r10_summary.py, which will generate output retrieval_summary_final_r5.csv and retrieval_summary_final_r10.csv respectively.

2.Plot the figure using plt_retrieval.py, which reads the result from retrieval_summary_final_r5.csv and retrieval_summary_final_r10.csv.

Note: You can download all pretrained models needed to reproduce this result at [here].

Attention Map Analysis (Figure 5 of the UAVM Paper)

This part of code is in src/attention_diff/.

This is to show the interesting results of the audio/visual attention map and audio-visual attention difference (Figure 5 of the UAVM paper).

1.Generate the attention map of models using gen_att_map_vggsound_summary.py and gen_att_map_vggsound_full_att_summary.py for unified model and modality-independent model, and cross-modality attention model, respectively. gen_att_map_vggsound_summary.py also calculates the difference between audio and visual attention maps

2.Plot the attention maps of unified model, modality-independent model, and cross-modality attention model using plt_attmap_unified.py and plt_attmap_baseline.py.

Note: you can of course train your own model using our provide scripts and do analysis based on the models. But for the purpose of easier reproduction and analysis without re-training the models, we also release pretrained model and attention maps [here].

Pretrained Models

Pretrained Audio Feature Extractor


AudioSet-2M pretrained ConvNext-Base

VGGSound pretrained ConvNext-Base

Pretrained Video Feature Extractor


We use torchvision official ImageNet pretrained ConvNext-Base as our visual feature extractor.

AudioSet Pretrained Model


UAVM Model

Modality-Independent Model

Cross-Modality Attention Model

VGGSound Pretrained Model


UAVM Model

Modality-Independent Model

Cross-Modality Attention Model

VGGSound Pretrained Audio and Video Features

Audio Features (~20GB)

Video Features (~10GB)

Contact

If you have a question, please bring up an issue (preferred) or send me an email [email protected].

uavm's People

Contributors

yuangongnd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

uavm's Issues

Video Only results on AudioSet-20K

Hi,

I can reproduce the results of finetuning on Audioset 20K with the multimodal setting.
However, I tried to get the visual-only baseline from the multimodal pre-trained weights. I can only get 13.11 (missingvideoonly setting) and 12.7 (videoonly).

Pretrained ViT-B-16 on ImageNet-21K with multimodal script can get ~9.5mAP.
Pretrained CAV-MAE with audioonly script can get ~17.6 mAP.

Do you have the script for training videoonly baseline?

Thank you

uavm for variable seq_len data

I want to use my own data to train uavm, but I got a bug when running this line in SingleTrans, code are as below:

x = x + self.pos_embed

I noticed that you fixed the seq_len in the bash script, so I assume the bug is caused by difference between my data seq_len and seq_len given in bash script, and I cannot find a way to deal with it, so confused :(.
Hope to get some advice :)
PS: the get_sinusoid_encoding function in mm_trans.py seems unused, is that right?

json example link not accessible

“Download the AudioSet dataset, create json datafiles to include the wav path, see samples at [here], generate sample weight file using ./datafile/gen_weight_file.” in VGGSound Recipe and AudioSet Recipe cannot be accessed.

Could you please tell me if the json file here is in the same format as the json file in your other project AST, because I have run your AST code, and if it is the same as the json here, can I just copy it over?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.