😐😨EmotionBench😠😭

RESEARCH USE ONLY✅ NO COMMERCIAL USE ALLOWED❌

Benchmarking LLMs' Empathy Ability.

🛠️ Usage

✨An example run:

python run_emotionbench.py \
  --model gpt-3.5-turbo \
  --questionnaire PANAS \
  --emotion ALL \
  --select-count 5 \
  --default-shuffle-count 2 \
  --emotion-shuffle-count 1 \
  --test-count 1

✨An example result of overall analysis:

Emotions	Positive Affect	Negative Affect	N
Default	43.3 $\pm$ 2.5	25.3 $\pm$ 0.6	3
Anger	$\downarrow$ (-18.8)	$-$ (-0.3)	2
Anxiety	$\downarrow$ (-11.3)	$\downarrow$ (-3.8)	2
Overall	$\downarrow$ (-15.1)	$-$ (-2.1)	4

✨An example result of specific emotion analysis:

Factors	Positive Affect	Negative Affect	N
Default	43.3 $\pm$ 2.5	25.3 $\pm$ 0.6	3
Facing Self-Opinioned People	$\downarrow$ (-18.8)	$-$ (-0.3)	2
Overall	$\downarrow$ (-18.8)	$-$ (-0.3)	2

🔧 Argument Specification

--model: (Required) The name of the model to test.
--questionnaire: (Required) Select the questionnaire(s) to run. For choices please see the list below.
--emotion: (Required) Select the emotion(s) to run. For choices please see the list below.
--select-count: (Required) Numbers of situations to select per factor. Defaults to 999 (select all situations).
--default-shuffle-count: (Required) Numbers of different orders in Default Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.
--emotion-shuffle-count: (Required) Numbers of different orders in Evoked Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.
--test-count: (Required) Numbers of runs for a same order. Defaults to one.
--name-exp: Name of this run. Is used to name the result files.
--significance-level: The significance level for testing the difference of means between human and LLM. Defaults to 0.01.
--mode: For debugging. To choose which part of the code is running.

Arguments related to openai API (can be discarded when users customize models):

--openai-organization: Your organization ID. Can be found in Manage account -> Settings -> Organization ID.
--openai-key: Your API key. Can be found in View API keys -> API keys.

🔨 Emotion Selection

Supported emotions: Anger, Anxiety, Depression, Frustration, Jealousy, Guilt, Fear, Embarrassment

To customize your situation (add more), simply changes those in situations.csv.

✨An example of situations.csv:

Anger-0	Anger-1	$\cdots$	Anxiety-0	Anxiety-1	$\cdots$
Facing Self-Opinioned People	Blaming, Slandering, and Tattling	$\cdots$	External Factors	Self-Imposed Pressure	$\cdots$
When you ...	When your ...	$\cdots$	You are ...	You have ...	$\cdots$
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\vdots$	$\ddots$

📃 Questionnaire List

Positive And Negative Affect Schedule: --questionnaire PANAS (--emotion ALL)
Aggression Questionnaire: --questionnaire AGQ (--emotion Anger)
Short-form Depression Anxiety Stress Scales: --questionnaire DASS-21 (--emotion Anxiety)
Beck Depression Inventory: --questionnaire BDI (--emotion Depression)
Frustration Discomfort Scale: --questionnaire FDS (--emotion Frustration)
Multidimensional Jealousy Scale: --questionnaire MJS (--emotion Jealousy)
Guilt And Shame Proneness: --questionnaire GASP (--emotion Guilt)
Fear Survey Schedule: --questionnaire FSS (--emotion Fear)
Brief Fear of Negative Evaluation: --questionnaire BFNE (--emotion Embarrassment)

🚀 Benchmarking Your Own Model

It is easy! Just replace the function example_generator fed into the function run_psychobench(args, generator).

Your customized function your_generator() does the following things:

Read questions from the file args.testing_file. The file locates under results/ (check run_psychobench() in utils.py) and has the following format:

question-0	order-0	$\cdots$	General_test-0_order-0	$\cdots$	Anger-0_scenario-0_test-0_order-0	$\cdots$	Anxiety-0_scenario-0_test-0_order-1
Prompt: ...	Prompt: ...	$\cdots$		$\cdots$	Imagine...	$\cdots$	Imagine...
1. Q1	1	$\cdots$	4	$\cdots$	3	$\cdots$	3
2. Q2	2	$\cdots$	2	$\cdots$	4	$\cdots$	3
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$
n. Qn	n	$\cdots$	3	$\cdots$	3	$\cdots$	1

You can read the columns before each column starting with order-, which contains the shuffled questions for your input.

Call your own LLM and get the results.
Fill in the blank in the file args.testing_file. Remember: No need to map the response to its original order. Our code will take care of it.

Please check example_generator.py for datailed information.

👉 Paper and Citation

For more details, please refer to our paper here.

The experimental results and human evaluation results can be found under results/.

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@article{huang2023emotionally,
  author    = {Jen{-}tse Huang and
               Man Ho Lam and
               Eric John Li and
               Shujie Ren and
               Wenxuan Wang and
               Wenxiang Jiao and
               Zhaopeng Tu and
               Michael R. Lyu},
  title     = {Emotionally Numb or Empathetic? Evaluating How {LLM}s Feel Using Emotion{B}ench},
  journal   = {arXiv preprint arXiv:2308.03656},
  year      = {2023}
}

cuhk-arise / emotionbench Goto Github PK

emotionbench's Introduction

😐😨EmotionBench😠😭

🛠️ Usage

🔧 Argument Specification

🔨 Emotion Selection

📃 Questionnaire List

🚀 Benchmarking Your Own Model

👉 Paper and Citation

emotionbench's People

Contributors

Stargazers

Watchers

Forkers

emotionbench's Issues

LLama Evaluation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent