Demo for paper: Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
https://github.com/DAMO-NLP-SG/LLM_summeval.git
Authors: Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You and Lidong Bing
This repository contains code and related resources of the paper "Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization".
@inproceedings{shen2023llmeval,
title={Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization},
author={Shen, Chenhui and Cheng, Liying and Nguyen, Xuan-Phi and Bing, Lidong and You, Yang},
booktitle={Findings of EMNLP},
url={"https://arxiv.org/abs/2305.13091"},
year={2023}
}
Pre-trained language models (PLMs) have accomplished impressive achievements in abstractive single-document summarization (SDS). However, such benefits may not be readily extended to muti-document summarization (MDS), where the interactions among documents are more complex. Previous works either design new architectures or new pre-training objectives for MDS, or apply PLMs to MDS without considering the complex document interactions. While the former does not make full use of previous pre-training efforts and may not generalize well across multiple domains, the latter cannot fully attend to the intricate relationships unique to MDS tasks. In this paper, we enforce hierarchy on both the encoder and decoder and seek to make better use of a PLM to facilitate multi-document interactions for the MDS task. We test our design on 10 MDS datasets across a wide range of domains. Extensive experiments show that our proposed method can achieve consistent improvements on all these datasets, outperforming the previous best models, and even achieving better or competitive results as compared to some models with additional MDS pre-training or larger model parameters.
Under Root Dir,
-
model_output_annotations/
: our processed SummEval annotations for the abstractive summarization systems. -
eval_model_generations/
: the outputs of LLM evaluations using RTS, MCQ or alternative prompts, under respective directories of the evaluation model name, with the evaluation method in the postfix (e.g.,_rts.json
,_mcq.json
, etc.) -
comp_data/
: our processed head-to-head comparison inputs from models in the SummEval dataset. -
comp_res/
: the output of LLM evaluations using H2H prompts. -
summeval.json
: the same file taken from this repo. -
eval_with_rts_or_mcq.py
: call ChatGPT or GPT-4 to evaluate with RTS or MCQ prompts; in order to run, add your own openai api key insecret.py
-
extract_model_scores.py
: extract all the llm-evaluated scores for a specific model stored undermodel_output_annotations
-
calc_data_corr.py
: to calculate correlation using the full 1200 summaries (results in Tab 5) for a given evaluator model. -
per_model_corr.py
: to calculate correlation for each candidate model. -
calc_meta_corr.py
: to calculate meta-correlation for a given evaluator model.
git clone https://github.com/010JIN/7404_Project_LLM_summeval.git
# Make sure the version of openai is updated.
pip install openai==0.28, tqdm, scipy, sercet
Creat a file named secret.py and set the openai key in format:
# Set your openai key, refer: https://openai.com/index/openai-api/
my_key = 'Enter your openai api key'
This demo only use gpt-3.5-turbo-0301 as evaluation model because of the GPU resource and the api token for Openai is limiting. The GPT 4 and other versions sometimes are not allowed to visit due serve country limit problem probably.
For RTS or MCQ
- Step 1: Get RTS or MCQ response from openai APIs
# Set print_full_prompt_without_calling_api = True for demo, in case of any fail connection of Openai.
# For dim, 0 is relevance, 1 is consistency, 2 is fluency, and 3 is coherence;
# For eval_type, 0 is RTS, 1 is MCQ, 2 is StarEval.
# !python eval_with_rts_or_mcq.py --eval_model <openai model> --dim <int from 0 to 4> --eval_type <int from 0 to 3>
python eval_with_rts_or_mcq.py --eval_model gpt-3.5-turbo-0301 --dim 0 --eval_type 0
![image](https://private-user-images.githubusercontent.com/105320955/346556777-98339e9a-d420-4580-8105-edf9a195feb1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxNjk5NTAsIm5iZiI6MTcyMjE2OTY1MCwicGF0aCI6Ii8xMDUzMjA5NTUvMzQ2NTU2Nzc3LTk4MzM5ZTlhLWQ0MjAtNDU4MC04MTA1LWVkZjlhMTk1ZmViMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyOFQxMjI3MzBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hYzcxNGI2ZDhmNDlhMTMyYjg1Y2EwMmExNTVhNmY3MTFmNDhjMjUwZjE2MWVlNzY1MzVjNDY3Yzk4MGQ2MmVjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.gzDiJYSMAmqg9EHUp-ll4SDUBoEeuSxOTwazasByFgQ)
- Step 2: Compile a new data file with all metric results
python extract_model_scores.py --eval_model gpt-3.5-turbo-0301
Part output as shown below:
- Step 3.1: Calculate correlation for all 1200 summaries
python calc_data_corr.py --eval_model gpt-3.5-turbo-0301 --eval_type 0
The output as shown below:
- Step 3.2: Calculate correlation for each candidate model
python per_model_corr.py --eval_model gpt-3.5-turbo-0301 --eval_type 0
The output as shown below:
- Step 3.3: Calculate meta-correlation
python calc_meta_corr.py --eval_model gpt-3.5-turbo-0301
The output as shown below: