Code Monkey home page Code Monkey logo

compareimagedescriptionmeasures's Introduction

Comparing Image Description Measures

Estimate the correlation of different text-based evaluation measures for Automatic Image Description. This is the code used in Elliott and Keller (2014). We repackage the descriptions and judgements from Hodosh et al. (2013), and include several other packages to ease calculations of the co-efficients. The code calculates the co-efficients for BLEU4, TER, Meteor, and ROUGE-SU4 using the Flickr8K corpus. However, this can be extended to any sentence-level evaluation measure and any corpus.

Depends on:

Comparison of different measures

We used the Flickr8K data set, expert human judgements, tokenized the text with Stanford CoreNLP PennTreebank Tokenizer with COCO custom punctuation removals. Note, this table is slightly different from our ACL paper because here we tokenized the text and include the recently proposed CIDEr measure.

Spearman's rho
CIDEr 0.581
Meteor 0.560
BLEU4 0.459
ROUGE-SU4 0.440
TER -0.290

Assumptions

You have candidate descriptions $cand, multiple reference descriptions $ref-1, $ref-2, ... and human judgements stored in a common directory $data.

Pre-processing descriptions

In our ACL'14 paper, we performed no tokenization. It is likely that you could increase the correlation co-efficients if you perform these steps but it may lead to reproduction errors.

  1. (Optional) Tokenise the text. Run python ptbtokenizer $file in the tokenizer directory. $file is the path to the original file. We use the exact tokenizer available in the Stanford Core NLP package.

  2. Split the text files into one file per description. Run python prepareROUGECandidates.py $cand, to split the candidates into individual files. Run python prepareROUGEReferences.py $ref{1,2,...} to split the references into individual files. You need to do this for the ROUGE-SU4 calculations.

Calculating text-based measures

  1. Calculate the Smoothed BLEU4, TER, and Meteor scores Run ./multeval.sh eval --refs ../$data/$ref-* --hyps-baseline ../$data/candidates --meteor.language en --sentLevelDir ../$data/sentLevel from the multeval directory. Note, the multeval.sh script downloads the METEOR paraphrase dictionaries on the first run, which may take a while.

  2. Calculate the ROUGE-SU4 scores Run ./rougeScores.sh $data/perline-candidates $data/perline-ref-0 5 directory.

Calculating correlation co-efficients

  1. Extract the Smoothed BLEU4, TER, and METEOR scores from the sys1.opt1 file. Run python extractMultEvalScores.py $data/sentLevel/sys1.opt1, where $data is the path to the descriptions data.

  2. Extract the ROUGE-SU4 scores from the ROUGE_result directory. Run python extractROUGEScores.py. This extracts the ROUGE-SU4 recall scores for each candidate into ROUGE_result/rougeScores.

  3. Append the ROUGE scores to the Multeval scores. Run python appendNewScores.py $newScoresFile $label $data/sentLevel/sys1.opt1, where $newScoresFile$ contains one number perline for each candidate sentence, and $label` is a human-readable label for the number.

  4. Calculate the correlations and plot the score distributions. Run Rscript calculateCorrelation.R $scoresFile $judgementsFile.

Visualisating the distribution of scores

Visualising the probability density estimates for evaluation measures can offer additional insight into the differences between measures. In the density visualisations below, we can see that BLEU4 struggles to separate three of the four classes of human judgements. Meteor does a much better job of separating descriptions that are Not Related from descriptions that have Minor Mistakes.

compareimagedescriptionmeasures's People

Contributors

elliottd avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.