Code Monkey home page Code Monkey logo

evaleval's Introduction

EvalEval

This repository contains the code for the paper Perturbation CheckLists for Evaluating NLG Evaluation Metrics to appear at EMNLP, 2021.

Authors: Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan and Mitesh M. Khapra.

Webpage: https://iitmnlp.github.io/EvalEval/

Contents

Overview

In this work we provide a detailed analysis of NLG metrics by going beyond correlation with human scores. We propose a comprehensive criteria-checklist based evaluation that will act as a diagnostic tool in pointing out specific avenues of improvement in metrics. We create specific templates that are targeted to evaluate the ability of a metric to capture a particular dimension.

Please find more details of this work in our paper.

Setup

Install Dependencies

Our code is based on python 3.7 and to install all the dependencies run the following command.

pip install -r requirements.txt

Load the data

All the original datasets used in our experiments can be directly downloaded by running the following command.

cd data
bash download.sh

To use custom datasets please follow the following format or feel free to make changes in the code to make it compatible.
jsonl format

{'id': 0, 'references':'Tom went to play in the garden', ...}
{'id': 1, 'references':'It will rain today', ...}
.
.

csv format

id, references, ...
0 , Tom went to play in the garden, ..
1 , It will rain today, ..

Note: DG follows a different format than the rest

Templates

All the templates used in our works have been made available in the templates/ folder and are categorized in the following sections.

All tasks have the following criteria, the table can also be found in our paper.

Task Criteria
Machine Translation (MT) Fluency, Adequacy
Abstrative Summarization (AS) Fluency, Coherence, Relevance, Coverage, Clarity
Image Captioning (IC) Fluency, Thoroughness, Correctness
Data to Text Generation (D2T) Fluency, Correctness, Coverage, Relevance
Question Generation (QG) Fluency, Answerability, Relevance
Dialogue Generation (DG) Fluency, Relevance, Making sense, Interesting, Avoid Repetition

All the templates save the perturbed sentences along with the original in the outputs folder. To test the metrics performance on these, pass the reference and perturbed sentences and compare the aggregated metric score over the entire dataset with the annotations score given for every template. More details can be found in the metrics section.

Data-to-Text Generation

To run the perturbations use the following command.

python3 main.py \
        --task D2T  \
        --ref_file data/<data.jsonl> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Coverage/Relevance>

Image Captioning

To run the perturbations use the following command.

python3 main.py \
        --task IC  \
        --ref_file data/<data.jsonl> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Completeness/Throughness>

Machine Translation

To run the perturbations use the following command.

python3 main.py \
        --task MT  \
        --ref_file data/<data.jsonl> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Adequacy>

Dialogue Generation

To run the perturbations use the following command.

python3 main.py \
        --task DG  \
        --ref_file data/<data.csv> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Avoid-repetition/Making-sense>

Abstrative Summarization

To run the perturbations use the following command.

python3 main.py \
        --task AS  \
        --ref_file data/<data.jsonl> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Coverage/Relevance/Clarity>

Question Generation

To run the perturbations use the following command.

python3 main.py \
        --task QG  \
        --ref_file data/<data.jsonl> \
        --output_file example \
        --criteria <all/Fluency/Invariance/Answerability>

Human Evaluations

The human annotations collected for the templates can be downloaded from here.

Metrics

Coming soon...

Citation

@InProceedings{Sai_2021_EMNLP,
    author = {Sai, Ananya B. and Dixit, Tanay and Sheth, Dev Yashpal and Mohan, Sreyas and Khapra, Mitesh M.},
    title = {Perturbation CheckLists for Evaluating NLG Evaluation Metrics},
    booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    month = {November},
    year = {2021}
}

evaleval's People

Contributors

tanay2001 avatar devsheth avatar iitmnlp avatar yhg0112 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.