This repository is the implementation of A Simple Baseline for Audio-Visual Scene-Aware Dialog .
The code is based on Hori’s naive baseline. We thank AVSD team for dataset and sharing implementation code.
- python 2.7
- pytorch 0.4.1
- numpy
- six
- java 1.8.0 (for coco-evaluation tools)
We use AVSD v0.1 official train-set. For validation and evaluation we use the prototype val-set and test-set. See DSTC7 AVSD challenge for more details. Please cite AVSD if you use their dataset.
Download AVSD annotations from this link, and extract to ‘data/’
Download CHARADES audio-video related features from this link, and extract to ‘data/charades_features’
The script has 4 stages
- stage 1 - preparation of dependent packages
- stage 2 - training
- stage 3 - generation of sentences on test-set
- stage 4 - evaluation of generated sentences
Use: $ ./run —stage X
to run desired stage.
You can follow this link for pretrained model.