Code Monkey home page Code Monkey logo

7404_project_llm_summeval's Introduction

Demo for paper: Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Refer:

https://github.com/DAMO-NLP-SG/LLM_summeval.git

Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You and Lidong Bing

This repository contains code and related resources of the paper "Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization".

@inproceedings{shen2023llmeval,
  title={Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization},
  author={Shen, Chenhui and Cheng, Liying and Nguyen, Xuan-Phi and Bing, Lidong and You, Yang},
  booktitle={Findings of EMNLP},
  url={"https://arxiv.org/abs/2305.13091"},
  year={2023}
}

Catalogue:


1. Introduction:

Pre-trained language models (PLMs) have accomplished impressive achievements in abstractive single-document summarization (SDS). However, such benefits may not be readily extended to muti-document summarization (MDS), where the interactions among documents are more complex. Previous works either design new architectures or new pre-training objectives for MDS, or apply PLMs to MDS without considering the complex document interactions. While the former does not make full use of previous pre-training efforts and may not generalize well across multiple domains, the latter cannot fully attend to the intricate relationships unique to MDS tasks. In this paper, we enforce hierarchy on both the encoder and decoder and seek to make better use of a PLM to facilitate multi-document interactions for the MDS task. We test our design on 10 MDS datasets across a wide range of domains. Extensive experiments show that our proposed method can achieve consistent improvements on all these datasets, outperforming the previous best models, and even achieving better or competitive results as compared to some models with additional MDS pre-training or larger model parameters.


2. File Contents

Under Root Dir,

  • model_output_annotations/: our processed SummEval annotations for the abstractive summarization systems.

  • eval_model_generations/: the outputs of LLM evaluations using RTS, MCQ or alternative prompts, under respective directories of the evaluation model name, with the evaluation method in the postfix (e.g., _rts.json, _mcq.json, etc.)

  • comp_data/: our processed head-to-head comparison inputs from models in the SummEval dataset.

  • comp_res/: the output of LLM evaluations using H2H prompts.

  • summeval.json: the same file taken from this repo.

  • eval_with_rts_or_mcq.py: call ChatGPT or GPT-4 to evaluate with RTS or MCQ prompts; in order to run, add your own openai api key in secret.py

  • extract_model_scores.py: extract all the llm-evaluated scores for a specific model stored under model_output_annotations

  • calc_data_corr.py: to calculate correlation using the full 1200 summaries (results in Tab 5) for a given evaluator model.

  • per_model_corr.py: to calculate correlation for each candidate model.

  • calc_meta_corr.py: to calculate meta-correlation for a given evaluator model.

3. Running the code

3.1. Set up the environment

git clone https://github.com/010JIN/7404_Project_LLM_summeval.git
# Make sure the version of openai is updated.
pip install openai==0.28, tqdm, scipy, sercet

3.2. Set the openai key:

Creat a file named secret.py and set the openai key in format:

# Set your openai key, refer: https://openai.com/index/openai-api/
my_key = 'Enter your openai api key'

3.3. Demo:

This demo only use gpt-3.5-turbo-0301 as evaluation model because of the GPU resource and the api token for Openai is limiting. The GPT 4 and other versions sometimes are not allowed to visit due serve country limit problem probably.

For RTS or MCQ

  • Step 1: Get RTS or MCQ response from openai APIs
# Set print_full_prompt_without_calling_api = True for demo, in case of any fail connection of Openai.
# For dim, 0 is relevance, 1 is consistency, 2 is fluency, and 3 is coherence;
# For eval_type, 0 is RTS, 1 is MCQ, 2 is StarEval.
# !python eval_with_rts_or_mcq.py --eval_model <openai model> --dim <int from 0 to 4> --eval_type <int from 0 to 3> 
python eval_with_rts_or_mcq.py --eval_model gpt-3.5-turbo-0301 --dim 0 --eval_type 0 
image
  • Step 2: Compile a new data file with all metric results
python extract_model_scores.py --eval_model gpt-3.5-turbo-0301

Part output as shown below:

1381720374334_ pic

  • Step 3.1: Calculate correlation for all 1200 summaries
python calc_data_corr.py --eval_model gpt-3.5-turbo-0301 --eval_type 0

The output as shown below:

4221720366686_ pic

  • Step 3.2: Calculate correlation for each candidate model
python per_model_corr.py --eval_model gpt-3.5-turbo-0301 --eval_type 0

The output as shown below:

4231720366691_ pic

  • Step 3.3: Calculate meta-correlation
python calc_meta_corr.py --eval_model gpt-3.5-turbo-0301

The output as shown below:

4251720366810_ pic

7404_project_llm_summeval's People

Contributors

010jin avatar shen-chenhui avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.