Code Monkey home page Code Monkey logo

simple-avsd's Introduction

Simple Baseline For Audio-Visual Scene-Aware Dialog

This repository is the implementation of A Simple Baseline for Audio-Visual Scene-Aware Dialog .

The code is based on Hori’s naive baseline. We thank AVSD team for dataset and sharing implementation code.

Required packages

  • python 2.7
  • pytorch 0.4.1
  • numpy
  • six
  • java 1.8.0 (for coco-evaluation tools)

Data

We use AVSD v0.1 official train-set. For validation and evaluation we use the prototype val-set and test-set. See DSTC7 AVSD challenge for more details. Please cite AVSD if you use their dataset.

Download AVSD annotations from this link, and extract to ‘data/’

Download CHARADES audio-video related features from this link, and extract to ‘data/charades_features’

Run

The script has 4 stages

  • stage 1 - preparation of dependent packages
  • stage 2 - training
  • stage 3 - generation of sentences on test-set
  • stage 4 - evaluation of generated sentences

Use: $ ./run —stage X to run desired stage.

You can follow this link for pretrained model.

simple-avsd's People

Contributors

idansc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

simple-avsd's Issues

Dimensions clarification

Hi,
I have two questions regarding dimensions,
1. Lets for example focus on the Unary unit, as far as I can see from the paper the input for this unit is of the shape : where is the embeddind dimension (i.e for image or for text..) and is the number of samples and after doing the mathematics mentioned in the paper we get that the output dimension is of the shape : which a little bit confuses me am I missing something ? otherwise how can we look at the importance of each feature for each sample? (here we got only 1 sample output).

2. I tried to run the following code which simply tries to add an unary attention on word embeddings of dimension 300 and I noticed that the atten class expects a tensor input of 3 dimensions (I assume the first dimension is for the batches number? fix me if i'm missing something)
s_output, s_lengths = nn.utils.rnn.pad_packed_sequence(s) print(s_output.transpose(1,2).size()) # This will print : Torch.Size([48, 300, 128]) s_with_atten = self.attention(utils=[s_output.transpose(1,2).cuda()]) print(s_with_atten[0].size()) # This will print : Torch.Size([48, 300])
I defined self.attention as following :
self.attention = Atten(util_e = [params.word_embedding_dim] , high_order_utils=[], pairwise_flag = False , unary_flag=True , self_flag = False)
can you elaborate more on the output dimensions printed?

wrong variable value

prior_flag=True, sizes=[10, 49, 49, 49, 49, 10], size_flag=False, pairwise_flag=True, unary_flag=True, self_flag=True)

shouldn't size_flag=False be set to True instead?
Because you already passed a size list to Atten, and inside the constructor, there is a code for initializing an empty size list in case size_flag is set to false, but you already passed a size list and thus no need to init a new size list with Nones

sizes = [None for _ in util_e]

Question about implementation

Hi,
I have a question about the following line of code :

self.pp_models[str(idx1)] = Pairwise(e_dim_1, sizes[idx1])

Assuming we set pairwise_flag=false, and unary_flag = true, as far as I can see from the code this will also add a pairwise potential between the passed utility and itself + unary potential for the same utility.
my question is what's the point of adding a pairwise for the utility and itself since pairwise_flag=false and unary_flag=true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.