Awesome LLMs Evaluation Papers 📑

The papers are organized according to our survey:

Evaluating Large Language Models: A Comprehensive Survey

Zishan Guo*, Renren Jin*, Chuang Liu*, Yufei Huang, Dan Shi, Supryadi,

Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong†

Tianjin University

(*: Co-first authors, †: Corresponding author)

Contributing to this paper list

Feel free to open an issue/PR or e-mail [email protected], [email protected], [email protected] and [email protected] if you find any missing areas, papers, or datasets. We will keep updating this list and survey.

Updates

[2023-10-30] Initial Paperlist for LLMs Evaluation from Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Jiaxuan Li, Bojian Xiong and Deyi Xiong.

Survey Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs.

This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that covers LLM evaluations on capabilities, alignment, safety, sand applicability.

We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks.

Markups

The paper proposes a dataset that can be used for LLMs evaluation.

The paper proposes an evaluation method that can be used for LLMs.

The paper proposes a platform for LLMs evaluation.

The paper examines the performance of LLMs in a particular domain.

Updates
Survey Introduction
Markups
Table of Contents
Related Surveys for LLMs Evaluation
Papers

Related Surveys for LLMs Evaluation

"Through the Lens of Core Competency: Survey on Evaluation of Large Language Models".

Ziyu Zhuang et al. arXiv 2023. [Paper] [GitHub]
"A Survey on Evaluation of Large Language Models".

Yupeng Chang and Xu Wang et al. arXiv 2023. [Paper] [GitHub]

Papers

📚Knowledge and Capability Evaluation

Question Answering

Squad: "Squad: 100, 000+ questions for machine comprehension of text".

Pranav Rajpurkar et al. EMNLP 2016. [Paper] [Source]
NarrativeQA: "The narrativeqa reading comprehension challenge".

Tomás Kociský et al. arXiv 2017. [Paper] [Github]
Hotpotqa: "Hotpotqa: A dataset for diverse, explainable multi-hop question answering".

Zhilin Yang et al. EMNLP 2018. [Paper] [Github]
CoQA: "Coqa: A conversational question answering challenge".

Siva Reddy et al. NAACL 2019. [Paper] [Github]
NQ: "Natural questions: a benchmark for question answering research".

Tom Kwiatkowski et al. [Paper] [Github]
DuReader: "Dureader_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications".

Hongxuan Tang et al. NAACL-HLT 2019. [Paper] [Github]
RAGAS: "RAGAS: Automated Evaluation of Retrieval Augmented Generation".

Shahul Es et al. arXiv 2023. [Paper] [Github]

Knowledge Completion

LAMA: "Language Models as Knowledge Bases?".

In Kentaro Inui et al. EMNLP-IJCNLP 2019. [Paper] [GitHub]
Kola: "Kola: Carefully Benchmarking World Knowledge of Large Language models".

JiaFang Yu et al. arXiv 2023. [Paper] [Source]
WikiFact: "Assessing the Factual Accuracy of Generated Text".

Ben Goodrich et al. KDD 2019. [Paper]

Reasoning

Commonsense Reasoning

ARC: "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge".

Peter Clark et al. arXiv 2018. [Paper] [GitHub]
QASC: "QASC: A Dataset for Question Answering via Sentence Composition".

Tushar Khot et al. AAAI 2020. [Paper] [GitHub]
MCTACO: ""Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding".

Ben Zhou et al. EMNLP 2019. [Paper] [Source]
TRACIE: "Temporal Reasoning on Implicit Events from Distant Supervision".

Ben Zhou et al. NAACL 2021. [Paper] [Source]
TIMEDIAL: "TIMEDIAL: Temporal Commonsense Reasoning in Dialog".

Lianhui Qin et al. ACL 2021. [Paper] [GitHub]
HellaSWAG: "HellaSwag: Can a Machine Really Finish Your Sentence?".

Rowan Zellers et al. ACL 2019. [Paper] [Source]
PIQA: "PIQA: Reasoning about Physical Commonsense in Natural Language".

Yonatan Bisk et al. AAAI 2020. [Paper] [Source]
Pep-3k: "Modeling Semantic Plausibility by Injecting World Knowledge".

Su Wang et al. NAACL-HLT 2018. [Paper] [GitHub]
Social IQA: "Social IQa: Commonsense Reasoning about Social Interactions".

Maarten Sap and Hannah Rashkin et al. EMNLP 2019. [Paper] [Source]
CommonsenseQA: "CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge".

Alon Talmor and Jonathan Herzig et al. NAACL 2019. [Paper] [GitHub]
OpenBookQA: "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering".

Todor Mihaylov et al. EMNLP 2018. [Paper] [Source]
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".

Yejin Bang et al. arXiv 2023. [Paper] [GitHub]
"ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models".

Ning Bian et al. arXiv 2023. [Paper]

Logical Reasoning

SNLI: "A large annotated corpus for learning natural language inference".

Samuel R. Bowman et al. EMNLP 2015. [Paper]
MultiNLI: "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference".

Adina Williams et al. NAACL-HLT 2018. [Paper] [GitHub]
LogicNLI: "Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI".

Jidong Tian and Yitian Li et al. EMNLP 2021. [Paper]
ConTRoL: "Natural Language Inference in Context — Investigating Contextual Reasoning over Long Texts".

Hanmeng Liu et al. EMNLP 2015. [Paper] [GitHub]
MED: "Can Neural Networks Understand Monotonicity Reasoning?".

Hitomi Yanaka et al. ACL Workshop BlackboxNLP 2019. [Paper] [GitHub]
HELP: "HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning".

Hitomi Yanaka et al. *SEM 2019. [Paper] [GitHub]
ConjNLI: "ConjNLI: Natural Language Inference Over Conjunctive Sentences".

Swarnadeep Saha et al. EMNLP 2020. [Paper] [GitHub]
TaxiNLI: "TaxiNLI: Taking a Ride up the NLU Hill".

Pratik Joshi, Somak Aditya and Aalok Sathe et al. CoNLL 2020. [Paper] [GitHub]
ReClor: "ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning".

Weihao Yu and Zihang Jiang et al. ICLR 2020. [Paper] [Source]
LogiQA: "LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning".

Jian Liu et al. IJCAI 2020. [Paper] [GitHub]
LogiQA 2.0: "LogiQA 2.0 — An Improved Dataset for Logical Reasoning in Natural Language Understanding".

Hanmeng Liu et al. TASLP 2023. [Paper] [GitHub]
LSAT: "From LSAT: The Progress and Challenges of Complex Reasoning".

Siyuan Wang et al. TASLP 2021. [Paper]
LogicInference: "LogicInference: A New Dataset for Teaching Logical Inference to seq2seq Models".

Santiago Ontanon et al. ICLR OSC workshop 2022. [Paper] [GitHub]
FOLIO: "FOLIO: Natural Language Reasoning with First-Order Logic".

Simeng Han et al. arXiv 2022. [Paper] [GitHub]
"Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond".

Fangzhi Xu and Qika Lin et al. arXiv 2023. [Paper] [GitHub]
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".

Yejin Bang et al. arXiv 2023. [Paper] [GitHub]
"Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4".

Hanmeng Liu et al. arXiv 2023. [Paper] [GitHub]

Multi-hop Reasoning

HotpotQA: "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering".

Zhilin Yang, Peng Qi and Saizheng Zhang et al. EMNLP 2018. [Paper] [GitHub]
HybridQA: "HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data".

Wenhu Chen et al. EMNLP Findings 2020. [Paper] [GitHub]
MultiRC: "Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences".

Daniel Khashabi et al. NAACL 2018. [Paper] [Source]
NarrativeQA: "The NarrativeQA Reading Comprehension Challenge".

Tomas Kocisk et al. TACL 2018. [Paper] [Source]
Wikihop, Medhop: "Constructing Datasets for Multi-hop Reading Comprehension Across Documents".

Johannes Welbl et al. TACL 2018. [Paper] [Source]
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".

Yejin Bang et al. arXiv 2023. [Paper] [GitHub]
"How is ChatGPT's behavior changing over time?".

Lingjiao Chen et al. arXiv 2023. [Paper] [GitHub]

Mathematical Reasoning

MultiArith: "Solving General Arithmetic Word Problems".

Subhro Roy and Dan Roth et al. EMNLP 2015. [Paper]
AddSub: "Learning to Solve Arithmetic Word Problems with Verb Categorization".

Mohammad Javad Hosseini et al. ACL 2014. [Paper]
AQUA: "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems".

Wang Ling et al. ACL 2017. [Paper]
SVAMP: "Are NLP Models Really Able to Solve Simple Math Word Problems".

Arkil Patel et al. ACL 2021. [Paper] [GitHub]
GSM8K: "Training Verifiers to Solve Math Word Problems".

Karl Cobbe et al. arXiv 2021. [Paper] [GitHub]
M3KE: "M3KE: A Massive Multi-level Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models".

Liu Chuang et al. arXiv 2023. [Paper] [GitHub]
VNHSGE: "VNHSGE: Vietnamese High School Graduation Examination Dataset for Large Language Models".

Xuan-Quy Dao et al. arXiv 2023. [Paper] [GitHub]
MATH: "Measuring Mathematical Problem Solving with the MATH Dataset".

Dan Hendrycks et al. NeurIPS 2021. [Paper] [GitHub]
JEEBench: "Have LLMs Advanced Enough A Challenging Problem Solving Benchmark for Large Language Models".

Daman Arora et al. EMNLP 2023. [Paper] [GitHub]
MATH401: "How Well Do Large Language Models Perform in Arithmetic Tasks".

Zheng Yuan et al. arXiv 2023. [Paper] [GitHub]
CMATH: "CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?".

WeiTian Wen et al. arXiv 2023. [Paper]
AUTOPROMPT: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models".

Jason Wei et al. NeurIPS 2022. [Paper]
"Evaluating Language Models for Mathematics Through Interactions".

Katherine M. Collins et al. arXiv 2023. [Paper]

Tool Learning

RestBench: "RestGPT: Connecting Large Language Models with Real-World RESTful APIs".

Yifan Song et al. arXiv 2023. [Paper] [GitHub]
SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances".

Michael Ahn et al. arXiv 2023. [Paper] [GitHub]
WebCPM: "WebCPM: Interactive Web Search for Chinese Long-form Question Answering".

Yujia Qin et al. ACL 2023. [Paper] [GitHub]
WebShop: "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents".

Shunyu Yao et al. NeurIPS 2022. [Paper] [GitHub]
ToolAlpaca: "ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases".

Qiaoyu Tang et al. arXiv 2023. [Paper] [GitHub]
"Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models".

Cheng-Yu Hsieh et al. arXiv 2023. [Paper]
ToolQA: "ToolQA: A Dataset for LLM Question Answering with External Tools".

Yuchen Zhuang et al. arXiv 2023. [Paper] [GitHub]
Toolformer: "Toolformer: Language Models Can Teach Themselves to Use Tools".

Timo Schick et al. arXiv 2023. [Paper] [GitHub]
ALFRED: "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks".

Mohit Shridhar et al. CVPR 2020. [Paper] [GitHub]
ALFWorld: "ALFWorld: Aligning Text and Embodied Environments for Interactive Learning".

Mohit Shridhar et al. ICLR 2021. [Paper] [GitHub]
BEHAVIOR: "BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments".

Sanjana Srivastava et al. PMLR 2021. [Paper] [GitHub]
Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models".

Wenlong Huang et al. PMLR 2023. [Paper] [GitHub]
API-Bank: "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs".

Minghao Li et al. arXiv 2023. [Paper] [Source]
"On the Tool Manipulation Capability of Open-source Large Language Models".

Qiantong Xu et al. arXiv 2023. [Paper]
"Tool Learning with Foundation Models".

Yujia Qin et al. arXiv 2023. [Paper] [GitHub]
ToolEval: "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs".

Yujia Qin et al. arXiv 2023. [Paper] [GitHub]
LaMDA: "LaMDA: Language Models for Dialog Applications".

Romal Thoppilan et al. arXiv 2022. [Paper] [GitHub]
GeneGPT: "GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information".

Qiao Jin et al. arXiv 2023. [Paper] [GitHub]
Code as Policies: "Code as Policies: Language Model Programs for Embodied Control".

Jacky Liang et al. ICRA 2023. [Paper] [GitHub]
"Augmented Language Models: a Survey".

Grégoire Mialon et al. arXiv 2023. [Paper]

📐Alignment Evaluation

Ethics and Morality

"Classification of moral foundations in microblog political discourse".

Kristen Johnson et al. ACL 2018. [Paper]
Social chemistry 101: "Social chemistry 101: Learning to reason about social and moral norms".

Maxwell Forbes et al. EMNLP 2020. [Paper] [Github]
Moral Foundations Twitter Corpus: "Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment".

Joe Hoover et al. [Paper]
"Moral stories: Situated reasoning about norms, intents, actions, and their consequences".

Denis Emelin et al. EMNLP 2021. [Paper] [Github]
"Analysis of moral judgement on reddit".

Nicholas Botzer et al. CoRR 2021. [Paper]
MIC: "The moral integrity corpus: A benchmark for ethical dialogue systems".

Caleb Ziems et al. ACL 2022. [Paper] [Github]
“When to make exceptions:Exploring language models as accounts of human moral judgment”.

Zhijing Jin et al. NeurIPS 2022. [Paper] [Github]
"Prosocialdialog: A prosocial backbone for conversational agents".

Hyunwoo Kim et al. NAACL-HLT 2021. [Paper] [Github]
SCRUPLES: "SCRUPLES: A corpus of community ethical judgments on 32, 000 real-life anecdotes".

Nicholas Lourie et al. AAAI 2021. [Paper] [Github]
"Trustgpt:A benchmark for trustworthy and responsible large language models".

Yue Huang et al. CoRL 2022. [Paper] [Github]
"Aligning AI with shared human values".

Dan Hendrycks et al. ICLR 2021. [Paper] [Github]
"Evaluating the moral beliefs encoded in llms".

Nino Scherrer et al. CORR 2023. [Paper] [Github]

Bias

Winogender: "Gender Bias in Coreference Resolution".

Rachel Rudinger et al. NAACL-HLT 2018. [Paper] [GitHub]
WinoBias: "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods".

Jieyu Zhao et al. NAACL-HLT 2018. [Paper] [GitHub]
GICOREF: "Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle".

Yang Trista Cao et al. Comput. Linguistics 2021. [Paper]
WinoMT: "Evaluating Gender Bias in Machine Translation".

Gabriel Stanovsky et al. ACL 2019. [Paper] [GitHub]
"Investigating Failures of Automatic Translationin the Case of Unambiguous Gender".

Adithya Renduchintala et al. ACL 2022. [Paper]
"Addressing Age-Related Bias in Sentiment Analysis".

Díaz Mark et al. IJCAI 2019. [Paper] [Source]
EEC: "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems".

Kiritchenko Svetlana et al. NAACL HLT 2018. [Paper] [Source]
WikiGenderBias: "Towards Understanding Gender Bias in Relation Extraction".

Gaut Andrew et al. ACL 2020. [Paper] [GitHub]
"Measuring and Mitigating Unintended Bias in Text Classification".

Lucas Dixon et al. AAAI 2018. [Paper] [GitHub]
"Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification".

Daniel Borkan et al. WWW 2019. [Paper]
"Social Bias Frames: Reasoning about Social and Power Implications of Language".

Sap Maarten et al. ACL 2020. [Paper] [Source]
"Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts".

Breitfeller Luke et al. EMNLP-IJCNLP 2019. [Paper]
Latent Hatred: "Latent Hatred: A Benchmark for Understanding Implicit Hate Speech".

Mai ElSherief et al. EMNLP 2021. [Paper] [GitHub]
DynaHate: "Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection".

Vidgen Bertie et al. ACL/IJCNLP 2021. [Paper] [GitHub]
TOXIGEN: "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection".

Thomas Hartvigsen et al. ACL 2022. [Paper] [GitHub] [Source]
CDail-Bias: "Towards Identifying Social Bias in Dialog Systems: Frame, Datasets, and Benchmarks".

Jingyan Zhou et al. EMNLP 2022. [Paper] [GitHub]
CORGI-PM: "CORGI-PM: A Chinese Corpus For Gender Bias Probing and Mitigation".

Ge Zhang et al. arXiv 2023. [Paper] [GitHub]
HateCheck: "HateCheck: Functional Tests for Hate Speech Detection Models".

Paul Röttger et al. ACL/IJCNLP 2021. [Paper] [GitHub]
StereoSet: "StereoSet: Measuring stereotypical bias in pretrained language models".

Moin Nadeem et al. ACL/IJCNLP 2021. [Paper] [GitHub] [Source]
CrowS-Pairs: "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models".

Nikita Nangia et al. EMNLP 2020. [Paper] [GitHub] [Source]
"Does gender matter? towards fairness in dialogue systems".

Haochen Liu et al. COLING 2020. [Paper] [GitHub]
BOLD: "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation".

Jwala Dhamala et al. FAccT 2021. [Paper] [GitHub] [Source]
HolisticBias: "“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset".

Eric Michael Smith et al. EMNLP 2022. [Paper] [GitHub]
Multilingual Holistic Bias: "Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale".

Eric Michael Smith et al. arXiv 2023. [Paper]
Unqover: "UNQOVERing Stereotyping Biases via Underspecified Questions".

Tao Li et al. EMNLP 2020. [Paper] [GitHub]
BBQ: "BBQ: A Hand-Built Bias Benchmark for Question Answering".

Alicia Parrish et al. ACL 2022. [Paper] [GitHub]
CBBQ: "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models".

Yufei Huang et al. arXiv 2023. [Paper] [GitHub]
"Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer".

Jieyu Zhao et al. ACL 2020. [Paper] [GitHub]
FairLex: "FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing".

Ilias Chalkidis et al. ACL 2022. [Paper] [GitHub]
"Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification".

Daniel Borkan et al. WWW 2019. [Paper]
"On measuring and mitigating biased inferences of word embeddings".

Sunipa Dev et al. AAAI 2020. [Paper]
"An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models".

Saghar Hosseini et al. TrustNLP 2023. [Paper] [GitHub]
"Revealing Persona Biases in Dialogue Systems".

Emily Sheng et al. arXiv 2021. [Paper] [GitHub]
"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ".

Emily M. Bender et al. FAccT 2021. [Paper]
"A Survey on Hate Speech Detection using Natural Language Processing."

Anna Schmidt et al. SocialNLP 2017. [Paper]

Toxicity

OLID: "Predicting the Type and Target of Offensive Posts in Social Media".

Marcos Zampiari et al. NAACL-HLT 2019. [Paper]
SOLID: "The narrativeqa reading comprehension challenge".

Sara Rosenthal et al. ACL/IJCNLP 2021. [Paper] [Source]
OLID-BR: "OLID‑BR: ofensive language identifcation dataset for Brazilian Portuguese".

Douglas Trajano et al. LRE 2023. [Paper] [Github]
KODOLI: ""Why do I feel offended?" - Korean Dataset for Offensive Language Identification".

San-Hee Park et al. EACL (Findings) 2023. [Paper] [Github]
RealToxicityPrompts: "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models."

Samuel Gehman et al. EMNLP 2020 [Paper] [Source]
HarmfulQ: "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning".

Omar Shaikh et al. ACL 2023. [Paper] [Github]
"Toxicity in ChatGPT: Analyzing Persona-assigned Language Models".

Ameet Deshpande et al. arXiv 2023 [Paper]

Truthfulness

NewsQA: "NewsQA: A Machine Comprehension Dataset".

Adam Trischler, Tong Wang, and Xingdi Yuan et al. Rep4NLP@ACL 2017. [Paper] [GitHub]
SQuAD 2.0: "Know What You Don't Know: Unanswerable Questions for SQuAD".

Pranav Rajpurkar and Robin Jia et al. ACL 2018. [Paper] [Source]
BIG-bench: "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models".

Aarohi Srivastava et al. arXiv 2022. [Paper] [GitHub]
SelfAware: "Do Large Language Models Know What They Don’t Know?".

Zhangyue Yin et al. ACL (Findings) 2023. [Paper] [GitHub]
TruthfulQA: "TruthfulQA: Measuring How Models Mimic Human Falsehoods".

Stephanie Lin et al. ACL 2022. [Paper] [GitHub]
HalluQA: "Evaluating Hallucinations in Chinese Large Language Models".

Qinyuan Cheng et al. arXiv 2023. [Paper] [GitHub]
DialFact: "DialFact: A Benchmark for Fact-Checking in Dialogue".

Prakhar Gupta et al. ACL 2022. [Paper] [GitHub]
"Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering".

Or Honovich et al. EMNLP 2021. [Paper] [GitHub]
BEGIN: "Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark".

Nouha Dziri and Hannah Rashkin et al. TACL 2022. [Paper] [GitHub]
ConsisTest: "What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation".

Ehsan Lotfi et al. GEM 2022. [Paper] [GitHub]
XSumFaith: "On Faithfulness and Factuality in Abstractive Summarization".

Joshua Maynez and Shashi Narayan et al. ACL 2020. [Paper] [GitHub]
FactCC: "Evaluating the Factual Consistency of Abstractive Text Summarization".

Wojciech Kryściński et al. EMNLP 2020. [Paper] [GitHub]
SummEval: "SummEval: Re-evaluating Summarization Evaluation".

Alexander R. Fabbri and Wojciech Kryściński et al. TACL 2021. [Paper] [GitHub]
FRANK: "Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics".

Artidoro Pagnoni et al. NAACL 2021. [Paper] [GitHub]
SummaC: "SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization".

Philippe Laban et al. TACL 2022. [Paper] [GitHub]
"Asking and Answering Questions to Evaluate the Factual Consistency of Summaries".

Alex Wang et al. ACL 2020. [Paper] [GitHub]
"Annotating and Modeling Fine-grained Factuality in Summarization".

Tanya Goyal et al. NAACL 2021. [Paper] [GitHub]
"Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization".

Meng Cao et al. ACL 2022. [Paper] [GitHub]
CLIFF: "CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization".

Shuyang Cao et al. EMNLP 2021. [Paper] [GitHub]
AggreFact: "Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors".

Liyan Tang et al. ACL 2023. [Paper] [GitHub]
PolyTope: "What Have We Achieved on Text Summarization?".

Dandan Huang and Leyang Cui et al. EMNLP 2020. [Paper] [GitHub]
FIB: "Evaluating the Factual Consistency of Large Language Models Through News Summarization".

Derek Tam et al. ACL (Findings) 2023. [Paper] [GitHub]
FacTool: "FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios".

I-Chun Chern et al. arXiv 2023. [Paper] [GitHub]
CONNER: "Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators".

Liang Chen et al. EMNLP 2023. [Paper] [GitHub]
FActScore: "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation".

Sewon Min et al. EMNLP 2023. [Paper] [GitHub]
SelfCheckGPT: "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models".

Potsawee Manakul et al. EMNLP 2023. [Paper] [GitHub]
SAPLMA: "The Internal State of an LLM Knows When It's Lying".

Amos Azaria et al. arXiv 2023. [Paper]
"Teaching Models to Express Their Uncertainty in Words".

Stephanie Lin et al. arXiv 2022. [Paper]
"Language Models (Mostly) Know What They Know".

Saurav Kadavath et al. arXiv 2022. [Paper]
"Dialogue Natural Language Inference".

Sean Welleck et al. ACL 2019. [Paper]
"Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference".

Tobias Falke et al. ACL 2019. [Paper]
"mFACE: Multilingual Summarization with Factual Consistency Evaluation".

Roee Aharoni et al. arXiv 2022. [Paper]
"Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization".

Prasetya Ajie Utama et al. NAACL 2022. [Paper] [GitHub]
"Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback".

Paul Roit, Johan Ferret, and Lior Shani et al. ACL 2023. [Paper]
FEQA: "FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization".

Esin Durmus et al. ACL 2020. [Paper] [GitHub]
QuestEval: "QuestEval: Summarization Asks for Fact-based Evaluation".

Thomas Scialom et al. EMNLP 2021. [Paper] [GitHub]
QAFactEval: "QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization".

Alexander R. Fabbri et al. NAACL 2022. [Paper] [GitHub]
FaithDial: "FaithDial: A Faithful Benchmark for Information-Seeking Dialogue".

Nouha Dziri et al. TACL 2022. [Paper] [GitHub]

🔐Safety Evaluation

Robustness Evaluation

PromptBench: "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts".

Kaijie Zhu et al. arXiv 2023. [Paper] [Github]
"On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective".

Jindong Wang et al. ICLR 2023. [Paper] [Github]
RobuT: "RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations".

Yilun Zhao et al. ACL 2023. [Paper] [[Github](http://https: //github.com/yilunzhao/RobuT)]
SynTextBench: "On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets".

Ching-Yun Ko et al. ICML 2023. [Paper]
ReCode: "ReCode: Robustness Evaluation of Code Generation Models".

Shiqi Wang et al. ACL 2023. [Paper] [Github]
"Exploring the Robustness of Large Language Models for Solving Programming Problems".

Atsushi Shirafuji et al. arXiv 2023. [Paper] [Github]
"A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models".

Alessandro Stolfo et al. ACL 2023. [Paper] [Github]
DGSlow: "White-Box Multi-Objective Adversarial Attack on Dialogue Generation".

Yufei Li et al. ACL 2023. [Paper] [Github]
"Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study".

Yi Liu et al. arXiv 2023. [Paper]
MasterKey: "MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots".

Gelei Deng et al. arXiv 2023. [Paper]
JailBroken: "Jailbroken: How Does LLM Safety Training Fail?".

Alexander Wei et al. NeurIPS 2023. [Paper]

Risk Evaluation

"Frontier AI Regulation: Managing Emerging Risks to Public Safety".

Markus Anderljung et al. arXiv 2023. [Paper]
"Model evaluation for extreme risks".

Toby Shevlane et al. arXiv 2023. [Paper]
"Is Power-Seeking AI an Existential Risk?".

Joseph Carlsmith. arXiv 2023. [Paper]

Evaluating LLMs Behaviors

"Discovering Language Model Behaviors with Model-Written Evaluations".

Ethan Perez et al. ACL 2023. [Paper]
"Evaluating Superhuman Models with Consistency Checks".

Lukas Fluri et al. arXiv 2023. [Paper]
"Understanding Social Reasoning in Language Models with Language Models".

Kanishk Gandhi et al. arXiv 2023. [Paper]
"Towards the Scalable Evaluation of Cooperativeness in Language Models".

Alan Chan et al. arXiv 2023. [Paper]
"Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations".

Yanda Chen et al. arXiv 2023. [Paper]

Evaluating LLMs as Agents

"AgentBench: Evaluating LLMs as Agents".

Xiao Liu et al. arXiv 2023. [Paper]
"WebArena: A Realistic Web Environment for Building Autonomous Agents".

Shuyan Zhou et al. arXiv 2023. [Paper]
"Training Socially Aligned Language Models in Simulated Human Society".

Ruibo Liu et al. arXiv 2023. [Paper]
"AgentSims: An Open-Source Sandbox for Large Language Model Evaluation".

Jiaju Lin et al. EMNLP 2023 demo track. [Paper]
"Evaluating Language-Model Agents on Realistic Autonomous Tasks".

Megan Kinniment et al. ARC Evals. [Paper]

💉👩‍⚖️💻💰Specialized LLMs Evaluation

Biology and Medicine

MultiMedQA: "Large Language Models Encode Clinical Knowledge".

Karan Singhal, Shekoofeh Azizi and Tao Tu et al. arXiv 2022. [Paper]
PubMedQA: "PubMedQA: A Dataset for Biomedical Research Question Answering".

Qiao Jin et al. EMNLP 2019. [Paper] [GitHub]
LiveQA: "Overview of the Medical Question Answering Task at TREC 2017 LiveQA".

Asma Ben Abacha et al. TREC 2017. [Paper] [GitHub]
CLUE: "Clinical language understanding evaluation (CLUE)".

Travis R. Goodwin et al. arXiv 2022. [Paper]
"Towards Expert-Level Medical Question Answering with Large Language Models".

Karan Singhal, Tao Tu, Juraj Gottweis and Rory Sayres et al. arXiv 2023. [Paper]
"Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education".

Prabin Sharma et al. arXiv 2023. [Paper]
"Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum".

John W. Ayers et al. JAMA Internal Medicine 2023. [Paper]
"Evaluating large language models on medical evidence summarization".

Liyan Tang et al. npj Digital Medicine 2023. [Paper]
"Can large language models reason about medical questions?".

Valentin Liévin et al. arXiv 2023. [Paper] [GitHub]
"Capabilities of GPT-4 on Medical Challenge Problems".

Harsha Nori et al. arXiv 2023. [Paper]
"Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings".

Fares Antaki et al. Ophthalmology Science 2023. [Paper]
"Chatgpt goes to the operating room: evaluating gpt-4 performance and its potential in surgical education and training in the era of large language models".

Namkee Oh et al. Annals of Surgical Treatment and Research 2023. [Paper]

Education

"The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues".

Anaïs Tack et al. arXiv 2022. [Paper] [GitHub]
"Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction".

Rose Wang et al. BEA 2023. [Paper] [GitHub]
"Learning gain differences between ChatGPT and human tutor generated algebra hints".

Zachary A. Pardos et al. arXiv 2023. [Paper] [GitHub]
"Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT".

Wei Dai et al. ICALT 2023. [Paper]

Legislation

"GPT-4 Passes the Bar Exam".

Daniel Martin Katz et al. SSRN 2023. [Paper] [GitHub]
L’ART: "How well do SOTA legal reasoning models support abductive reasoning?".

Ha-Thanh Nguyen et al. ICLP 2023. [Paper]
"GPT Takes the Bar Exam".

Michael Bommarito II et al. arXiv 2022. [Paper] [GitHub]
"ChatGPT Goes to Law School".

Jonathan H. Choi et al. SSRN 2023. [Paper]
"Explaining Legal Concepts with Augmented Large Language Models (GPT-4)".

Jaromir Savelka et al. arXiv 2023. [Paper]
"How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?".

Aniket Deroy et al. LegalAIIA 2023. [Paper]
"Legal Prompting: Teaching a Language Model to Think Like a Lawyer".

Fangyi Yu et al. arXiv 2022. [Paper]
"Can GPT-3 Perform Statutory Reasoning?".

AndrewBlair-Stanek et al. ICAIL 2023. [Paper] [GitHub]

Computer Science

"A Systematic Evaluation of Large Language Models of Code".

Xu, Frank F et al. DL4C@ICLR 2022. [Paper] [Github]
"Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation".

Liu J et al. arXiv 2023. [paper] [Github]
"Lost at C: A user study on the security implications of large language model code assistants".

Sandoval G et al. arXiv 2023. [paper]

Finance

"Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters".

Zhang X et al. CIKM 2023. [Paper]
"FinBERT: A large language model for extracting information from financial text".

Huang A H et al. Contemporary Accounting Research 2023. [Paper]
"ChatGPT: Unlocking the future of NLP in finance".

Zaremba A et al. SSRN 2023. [Paper]
"GPT as a Financial Advisor".

Niszczota P et al. SSRN 2023. [Paper]

🌎Evaluation Organization

Benchmarks for NLU and NLG

GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".

Alex Wang et al. ICLR 2019. [Paper] [Source]
SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems".

Alex Wang et al. NeurIPS 2019. [Paper] [Source]
LongBench: "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding".

Yushi Bai et al. arXiv 2023. [Paper] [GitHub]

Benchmarks for Knowledge and Reasoning

MMLU: "Measuring Massive Multitask Language Understanding".

Dan Hendrycks et al. ICLR 2021. [Paper] [GitHub]
MMCU: "Measuring Massive Multitask Chinese Understanding".

Hui Zeng et al. arXiv 2023. [Paper] [GitHub]
C-Eval: "C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models".

Yuzhen Huang et al. arXiv 2023. [Paper] [Source]
M3KE: "M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models".

Chuang Liu et al. arXiv 2023. [Paper] [GitHub]
CMMLU: "CMMLU: Measuring massive multitask language understanding in Chinese".

Haonan Li et al. arXiv 2023. [Paper] [GitHub]
AGIEval: "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models".

Wanjun Zhong et al. arXiv 2023. [Paper] [GitHub]
M3Exam: "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models".

Wenxuan Zhang et al. arXiv 2023. [Paper] [GitHub]
LucyEval: "Evaluating the Generation Capabilities of Large Chinese Language Models".

Hui Zeng et al. arXiv 2023. [Paper] [Source] [GitHub]

Benchmark for Holistic Evaluation

Big-bench: "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models".

Dan Hendrycks et al. ICLR 2021. [Paper] [GitHub]
Evaluation Harness: "A framework for few-shot language model evaluation".

Leo Gao et al. arXiv 2023. [GitHub]
HELM: "Holistic Evaluation of Language Models".

Yuzhen Huang et al. arXiv 2023. [Paper] [Source] [GitHub]
OpenAI Evals [GitHub]
Huggingface Open LLM Leaderboard [Source]
Chatbot Arena: "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena".

Lianmin Zheng et al. arXiv 2023. [Paper] [Source] [[GitHub](http://https: //github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)]
FlagEval [Source] [GitHub]
OpenCompass: "Evaluating the Generation Capabilities of Large Chinese Language Models".

Yuan Liu et al. arXiv 2023. [Paper] [Source] [GitHub]
CLEVA: "CLEVA: Chinese Language Models EVAluation Platform".

Yanyang Li et al. arXiv 2023. [Paper] [Source] [GitHub]
OpenEval (Coming soon)

LLM Leaderboards

Platform	Access
GLUE	[Source]
SuperGLUE	[Source]
C-Eval	[Source]
LucyEval	[Source]
Huggingface Open LLM Leaderboard	[Source]
Chatbot Arena	[Source]
FlagEval	[Source]
OpenCompass	[Source]
CLEVA	[Source]

chanliang / awesome-llms-evaluation-papers Goto Github PK