eq-bench / eq-bench Goto Github PK

A benchmark for emotional intelligence in large language models

License: MIT License

Python 99.38% Shell 0.62%

eq-bench's Introduction

EQ-Bench

EQ-Bench is a benchmark for language models designed to assess emotional intelligence. You can read more about it in our paper.

The latest leaderboard can be viewed at EQ-Bench Leaderboard.

News

Creative Writing v2

2024-06-29 (v2.4 of the benchmark pipeline)

We've released v2 of the creative writing benchmark & leaderboard. The old version was starting to saturate (scores bunching at the top), so we removed some of the less discriminative prompts, switched judge models, and made some other improvements besides.

Creative Writing v2 Changes

Default min_p = 0.1, temp = 1 for transformers & oobabooga inference
Change to Claude 3.5 Sonnet as judge (from Claude 3 Opus)
Removed some prompts and added new ones; 24 in total now.
Reworked the scoring criteria
Criteria now are weighted (to increase discriminative power)
Leaderboard models are now tested for 10 iterations
Leaderboard now shows error bars for 95% confidence interval
Sample txt on leaderboard now show scores for all iterations, as well as inference settings

### 2024-04-19 Minor updates

Changed behaviour when using Transformers and no chat template is specified. In this scenario, the benchmark will now apply the tokenizer's chat template if there is one.
Models are now loaded in 16 bit precision if "none" quantisation is selected.
Preliminary support for Llama3 models (adding <|eot_id|> to the tokenizer).

### Version 2.3

This version includes two new benchmarks: creative-writing and judgemark.

Creative Writing Benchmark

This is a LLM-as-a-judge benchmark using detailed criteria to assess the model's output to a set of creative writing prompts.

Judgemark

This is the latest benchmark task in the EQ-Bench pipeline. It tests a model's ability to judge creative writing from a set of pre-generated outputs from 20 test models. The writing prompts & judging process are the same as those used in the creative writing benchmark. Several metrics are aggregated on the judge's performance (correlation with other benchmarks + measures of spread), resulting in a "Judgemark" score.

Launching the New Benchmarks

To launch individual benchmark tasks:

python eq-bench.py --benchmarks eq-bench

python eq-bench.py --benchmarks creative-writing

python eq-bench.py --benchmarks judgemark

The creative-writing and judgemark tasks require the following parameters to be configures in your config.cfg:

judge_model_api = judge_model = judge_model_api_key =

Creative Writing Benchmark Details

Official scores for the creative-writing benchmark use claude-3-opus as judge. However you can use any openai, mistralai or anthropic model as judge. Be aware that the results won't be directly comparable between judge models.

The creative writing benchmark involves 19 writing prompts. The model's output is then judged according to 36 criteria for good & bad writing, with each criteria being scored 0-10.

Given the small number of questions in the test set, you may wish to run several iterations of the benchmark to reduce variance. You can set n_iterations in the config per benchmark run. We recommend 3+ iterations. Benchmarking a model over 3 iterations using Claude Opus will cost approx. $3.00.

Temperature is set at 0.7 for the test model inference, so output will vary between iterations.

For more details, click here.

### Version 2.2 Released

Changes:

Added llama.cpp support
Fixed bug with ooba regularly crashing
Misc compatibility & bug fixes

If using llama.cpp as the inferencing engine, you will need to launch the llama.cpp server first and then run the benchmark. The benchmark will look for the api at the default address of http://127.0.0.1:8080. Multiple benchmark runs are not supported when using llama.cpp.

Other news: EQ-Bench has been added to eleuther-eval-harness! Go check it out.

### Version 2.1 Released

Changes:

Add support for additional languages (just en and de in this release)
Add support for custom OpenAI-compatible api endpoints, such as ollama
By default the revision component of the questions is not included

v2.1 details

DE Support

German language support was kindly added by CrispStrobe. The prompts were translated by GPT-4. You can expect the scores using the German version to be slightly lower than the English version, assuming the model's language competency for each is equal.

Revision Component

After collecting a lot of data from v2, it's clear that the revision component has a mostly negative effect. Only 8% of the time does it improve the score, and on average the revised score is 2.95% lower than the first pass score. Since we choose the highest of the first pass vs revised aggregate scores, the revision component is rarely affecting the overall score.

Since revising requires significantly more inference, we opt to set it off by default. You can still enable it with the -revise argument. The upshot of disabling revision is that the benchmark is now much cheaper/faster to run, and the prompts are a little less complex. This change should have a negligible effect on scores.

### Version 2 Released

V2 of EQ-Bench contains 171 questions (compared to 60 in v1) and a new system for scoring. It is better able to discriminate performance differences between models. V2 is less subject to variance caused by perturbations (e.g. temp, sampler, quantisation, prompt format, system message). Also added is the ability to upload results to firebase.

We encourage you to move to v2, and to note which version you are using (EQ-Bench v1 or EQ-Bench v2) when publishing results to avoid confusion.

NOTE: V1 scores are not directly comparable to v2 scores.

More v2 details

Version 2 of the benchmark brings three major changes:

Increased the number of test questions from 60 to 171.
Changed the scoring system from normalised to full scale.
Uploading results to firebase.

Known issues:

When using oobabooga as the inferencing engine, the api plugin stops responding after approx. 30 queries. This is handled by the benchmark pipeline by the query timing out (according to the value set in config.cfg), and then reloading ooba. The cause is unknown at this stage; the benchmark should however still complete.

Score sensitivity to perturbations

Originally 200 dialogues were generated for the test set, of which 60 of the best (most coherent & challenging) were selected for v1 of the benchmark. We had initially established very low variance between runs of the v1 benchmark, when holding all parameters the same. However it has become apparent that minor perturbations to the model or inferencing parameters can cause score variance beyond what is explained by the actual change in performance.

Traditional multiple choice tests are less prone to this kind of variance because these perturbations are unlikely to change an answer from "A" to "B". In contrast, EQ-Bench questions require a subjective prediction of emotional intensity on a range of 0-10. Small perturbations to the model or inferencing params can produce significantly different numerical predictions. This is a source of noise that can be mitigated by increasing the number of questions. So for v2 we opted to expand the test set to 171 out of the originally generated 200.

We tested v1 against v2 for a number of models, while controlling a range of parameters (temp, sampler, quantisation, prompt format, system message). We find v2 scores to be significantly more stable to perturbations to these variables, and so we expect the scores to be more closely representative of the true performance of the model.

Scoring system changes

In v1 of EQ-Bench we elected to normalise the four emotional intensity ratings in each question to sum to 10. The reasoning for this was that different subjects might have different ideas about, for example, what constitutes a 10 rating. Given the subjectivity here, multiple perspectives can be valid.

A systematic bias in how the subject rates emotional intensity might correlate with a similar systematic bias in the creators of the reference answers, resulting in an artificially inflated score. So to eliminate this issue we normalised both the reference answer and the subject answer so that we are only comparing the relative intensity of each emotion.

This seemed like a good idea at the time, however normalising in this way is far from a perfect solution. It handles certain edge cases poorly, and several models benchmarked with numbers that were significant outliers compared to other major benchmarks (E.g. Mixtral 8x7 produced unusually low scores). In addition, normalising the answers means we are losing the ability to assess the model's ability to make reasonable predictions about the absolute intensity of emotions.

In v2 we opted for a different approach: We still calculate the score by computing the difference from the reference answer, however, we no longer normalise the values. To mitigate the subjective nature of rating emotional intensity, we scale down smaller differences (differences between 1-4 from reference) on a curve. Differences from 5 to 10 are counted 1:1.

The result of these changes is better discriminative ability of the benchmark, and generally slightly higher scores compared to v1. As with v1, the score baseline is calibrated so that a score of 0 corresponds to answering randomly, and a score of 100 matches the reference answers exactly.

### Version 1.1 Released

This version adds support for Oobabooga. The benchmark pipeline can automatically download each model, launch the model with ooba using the specified parameters, and close the ooba server after the run completes, optionally deleting the model files.

Requirements

Linux
Python 3x
Working install of Oobabooga (optional)
Sufficient GPU / System RAM to load the models
Python libraries listed in install_reqs.sh

Show python libraries

EQ-bench requirements

tqdm
sentencepiece
hf_transfer
openai
scipy
torch
peft
bitsandbytes
transformers (preferably the latest version installed directly from GitHub: huggingface/transformers)
trl
accelerate
tensorboardX
huggingface_hub

Requirements for QWEN models

einops
transformers_stream_generator (version 0.0.4)
deepspeed
tiktoken
flash-attention (the latest version installed directly from GitHub: Dao-AILab/flash-attention)
auto-gptq
optimum

Requirements for uploading results

gspread
oauth2client
firebase_admin

Installation

Quick start:

If you are installing EQ-Bench into a fresh linux install (like a runpod or similar), you can run ooba_quick_install.sh. This will install oobabooga into the current user's home directory, install all EQ-Bench dependencies, then run the benchmark pipeline.

Install (not using quick start)

Note: Ooobabooga is optional. If you prefer to use transformers as the inference engine, or if you are only benchmarking through the OpenAI API, you can skip installing it.

Install the required Python dependencies by running install_reqs.sh.
Optional: install the Oobabooga library and make sure it launches.
Optional: Set up firebase / firestore for results upload (see instructions below).
Optional: Set up Google Sheets for results upload (see instructions below).

Configure

Set up config.cfg with your API keys and runtime settings.
Add benchmark runs to config.cfg, in the format:
- run_id, instruction_template, model_path, lora_path, quantization, n_iterations, inference_engine, ooba_params, downloader_filters
  - run_id: A name to identify the benchmark run
  - instruction_template: The filename of the instruction template defining the prompt format, minus the .yaml (e.g. Alpaca)
  - model_path: Huggingface model ID, local path, or OpenAI model name
  - lora_path (optional): Path to local lora adapter
  - quantization: Using bitsandbytes package (8bit, 4bit, None)
  - n_iterations: Number of benchmark iterations (final score will be an average)
  - inference_engine: Set this to transformers, openai, ooba or llama.cpp.
  - ooba_params (optional): Any additional ooba params for loading this model (overrides the global setting above)
  - downloader_filters (optional): Specify --include or --exclude patterns (using same syntax as huggingface-cli download)

Benchmark run examples

# run_id, instruction_template, model_path, lora_path, quantization, n_iterations, inference_engine, ooba_params, downloader_args

myrun1, openai_api, gpt-4-0613, , , 1, openai, ,

myrun2, Llama-v2, meta-llama/Llama-2-7b-chat-hf, /path/to/local/lora/adapter, 8bit, 3, transformers, , ,

myrun3, Alpaca, ~/my_local_model, , None, 1, ooba, --loader transformers --n_ctx 1024 --n-gpu-layers -1,

myrun4, Mistral, TheBloke/Mistral-7B-Instruct-v0.2-GGUF, , None, 1, ooba, --loader llama.cpp --n-gpu-layers -1 --tensor_split 1,3,5,7, --include ["*Q3_K_M.gguf", "*.json"]

myrun5, Mistral, mistralai/Mistral-7B-Instruct-v0.2, , None, 1, ooba, --loader transformers --gpu-memory 12, --exclude "*.bin"

myrun6, ChatML, model_name, , None, 1, llama.cpp, None,

Running the benchmark

Run the benchmark:
- python3 eq-bench.py --benchmarks <eq-bench|creative-writing|judgemark>
Results are saved to benchmark_results.csv

Script Options

-h: Displays help.
-w: Overwrites existing results (i.e., disables the default behaviour of resuming a partially completed run).
-d: Downloaded models will be deleted after each benchmark successfully completes. Does not affect previously downloaded models specified with a local path.
-f: Use hftransfer for multithreaded downloading of models (faster but can be unreliable).
-v: Display more verbose output.
-r: Set the number of retries to attempt if a benchmark run fails. Default is 5.
-l: Sets the language: en and de currently supported. Defaults to English if not specified.
-v1: Runs v1 of the benchmark (legacy). If not set, the benchmark defaults to v2.
-revise: Enables the revision component of the test questions (this is off by default since v2.1).
--benchmarks: Specify the benchmarks to run (separate by comma): <eq-bench|creative-writing|judgemark>

Prompt Formats / Instruction Templates

EQ-Bench uses the same instruction template format as the Oobabooga library. You can modify the existing ones or add your own. When you specify a prompt format in config.cfg, use the filename minus the .yaml, e.g. Alpaca.

If using transformers as the inference engine, the benchmark pipeline uses templates located in [EQ-Bench dir]/instruction-templates.
If using ooba as the inference engine, the pipeline uses templates located in [ooba dir]/instruction-templates

When using transformers, if you leave the prompt format blank in config.cfg, transformers will apply the chat template in the tokenizer if there is one.

When using ooba, if you leave the prompt format blank in config.cfg, ooba will make its best guess as to what the prompt format should be.

Setting up Firebase / Firestore for Results Uploading (Optional)

Show instructions

Create a new firebase project.
Create a service account within this project.
Generate a new private key, save to firebase_creds.json in EQ-Bench root directory.
Create a default firestore database in the project.

When EQ-Bench sees firebase_creds.json in the EQ-Bench directory, it will upload results to this firestore db when a benchmark run completes.

Setting up Google Sheets for Results Uploading (Optional)

Show instructions

Create a new Google Sheet.
Set the share settings so that anyone with the link can edit.
Set google_spreadsheet_url in config.cfg to the URL of the sheet you just created.
Go to Google Cloud Console.
Create a new project and ensure it is selected as active in the dropdown at the top of the page.
Enable the Google Sheets API for the project:
- In the search bar, type "sheets"
- Click Google Sheets API
- Click Enable
Create a service account:
- In the search bar, type "Service accounts" and click the appropriate result
- Click + Create Service Account
- Give it a name & id, then click Create and continue
- Grant this service account access: Basic -> Editor
- Click Done
Click on the service account, then navigate to Keys -> Add key -> Create new key -> JSON.
Save the file to google_creds.json in the eq-bench directory.

Cite

@misc{paech2023eqbench,
      title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models}, 
      author={Samuel J. Paech},
      year={2023},
      eprint={2312.06281},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

eq-bench's People

Contributors

Stargazers

Watchers

Forkers

gorgeouswang techthiyanes techwithray codeaudit crispstrobe dnhkng geronimi73 zeroxclem thomaskalnik enelpol speakleash aipoweryou94 vibster

eq-bench's Issues

Offload to the cpu

With this config:

openchat-gemma, , openchat/openchat-3.5-0106-gemma, , , 1, transformers, , ,
Nous-Hermes-2-SOLAR-10.7B, , NousResearch/Nous-Hermes-2-SOLAR-10.7B, , , 1, transformers, , ,

I have got warning for the second model: WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. and evaluation is very slow.

I am running it on GPU with 40GB.

If only Nous-Hermes-2-SOLAR-10.7B is in the config then everything is fine. I guess the previous model is not removed before loading the next one - I see the del model in cleanup, but it does nothing actually.

Backend changes scores significantly

I've been using ExllamaV2 to test miqu-1-70b-sf-4.0bpw-h6-exl2 and I get a score of 82.71, so on par with that listed on the leaderboard at https://eqbench.com/.

However, I just tried running the original GGUF model at miqu-1-70b using the server from llama.cpp, and I only scored 75.26!

That's significantly lower, so I'm wondering where the error could be.

UPDATE: Nope, I modified the chat template, and got a score of 75.29
UPDATE 2: Just tested the original 5-bit miqu-1-70b.q5_K_M.gguf, and that also only score 75.50, so it's not a quantization issue.
UPDATE 3: Retested miqu-1-70b-sf-4.0bpw-h6-exl2 on the same machine with a fresh install and got 81.75, so the Exllama version scores much higher than the Llama.cpp version, even though the Llama.cpp version is the original leaked version! That seems very odd, as the process of de-quantizing, changing model format and re-quantizing shouldn't lead to much higher EQ-Bench scores!

[Request] Creative Writing Benchmark

Model : OpenCrystal L3 12B
url : https://huggingface.co/Darkknight535/OpenCrystal-12B-L3
Parameter : 12b
Quantization : None
ctx : (8192)

Paper on creative writing benchmark

Hi,

Is there a paper or more info on methodology for the creative writing benchmark?

'BitsAndBytesConfig' object has no attribute 'get_loading_attributes'

The parameters which I pass in config.cfg file is:
myrun2, ChatML, TheBloke/WestLake-7B-v2-GPTQ, , 4bit, 1, transformers, , ,

I am using the model id "TheBloke/WestLake-7B-v2-GPTQ" It is GPTQ model, and I am facing the error in finding its score on EQ-Bench. please guide me can we calculate the eq-bench score of a GPTQ model such as "TheBloke/WestLake-7B-v2-GPTQ", how can I calculate its eq score?
How can I solve this error? i want to find its eq-bench score.

v2 outputs

Hey! Great work on setting this up.

Can you publish the model outputs and judge annotations like alpaca eval does?
https://github.com/tatsu-lab/alpaca_eval/tree/main/results

It's really helpful for

researching benchmarks and correlations with each other
getting help improving future versions of EQ-b
reproducing results

Contributing with other judges

I'd like to contribute the Creative Writing benchmark.

Since I live in the EU, I do not have access to the Claude API. I am currently running using Mixtral 8x22B as the judge via Mistral API.

Can I contribute those results? Also, is there any tutorial on how to share the results, do I just make a PR? 🙂

Contributing to OpenCompass

First of all, thank you for your high-quality open-source project. We are very interested in your EQ bench and creative writing bench, as they share many similarities with our subjective evaluations in OpenCompass. I would like to know if you would be willing to integrate these two benches into OpenCompass to enable a more diverse range of evaluations?
Here is the link of OpenCompass: https://github.com/open-compass/opencompass
And here is a demo for subjective evaluation in Opencompass: https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_alignbench.py

The prompt to generate the dialogue.

Dear authors,
Could you give us some examples about how to generate the dialogues using the GPT-4 or other LLMs.
We cannot find the details about that in the paper and website.
Thank you all so much for your help.

Add Claude

Claude 2.1 and Claude Instant would be great to benchmark 🙏

Support for Seq2Seq LMs

How can I run the local flan-t5 model on this benchmark? I found that oobabooga already supports Seq2Seq models : Add support for Seq2Seq LMs
I tried to test using alpaca's instruction template, but didn't get any output.

# config.cfg
[Oobabooga config]
ooba_launch_script = ~/text-generation-webui/start_linux.sh
ooba_params_global = 
automatically_launch_ooba = true
ooba_request_timeout = 120

[Benchmarks to run]
run-t5, None, ~/text-generation-webui/models/flan-t5-xl, , None, 1, ooba, --loader transformers --n_ctx 1024 --n-gpu-layers -1,

# benchmark_results.csv
run-t5,2024-01-26 16:32:07,Alpaca,/home/user/text-generation-webui/models/flan-t5-xl,,none,FAILED,FAILED,FAILED,1,ooba,,,0.0 questions were parseable (min is 83%)

# raw_results
    "run-t5--v2--/home/user/text-generation-webui/models/flan-t5-xl----Alpaca--none--ooba----": {
        "run_metadata": {
            "run_id": "run-t5",
            "eq_bench_version": "v2",
            "instruction_template": "Alpaca",
            "model_path": "/home/user/text-generation-webui/models/flan-t5-xl",
            "lora_path": "",
            "bitsandbytes_quant": "none",
            "total_iterations": 1,
            "inference_engine": "ooba",
            "ooba_params": "",
            "include_patterns": [],
            "exclude_patterns": []
        },
        "iterations": {
            "1": {
                "respondent_answers": {},
                "individual_scores": {},
                "individual_scores_fullscale": {},
                "raw_inference": {},
                "benchmark_results_fullscale": {
                    "first_pass_score": 0,
                    "first_pass_parseable": 0,
                    "revised_score": 0,
                    "revised_parseable": 0,
                    "final_score": 0,
                    "final_parseable": 0
                }
            }
        }
    }

I would be very grateful for your help.

model test request

Now that the llama.cpp server is running correctly, would it be possible to have this model tested?

https://huggingface.co/Infinimol/miiqu-gguf
using ChatML format and context length >= 1024, please :)

It is a model I been working on it for some time, and I think it's interesting. It is not a fine-tune, but a merge, and I find it consistently scores higher than the base model (miqu), which I think is a first for a pure merge model. Eq-bench runs in about 15 mins on an A100.

The model is GGUF, but split to fit under the 50Gb limit on Huggingface, but the model card give the one-liner to reassemble the file.

default judge model setting for the leaderboard

may I ask what the default judge model is?

Add some of the new 100B+ models to the leaderboard

It would be really cool to see how some of the more recent megamerges stack up against the rest of the competition, especially those which build on top of miqu-1-70b. Specifically, I think it would be interesting to benchmark the EQ of the following models:

miqu-1-120b (miqu-1-70b + miqu-1-70b)
miquliz-120b-v2.0 (miqu-1-70b + lzlv_70b)
lzlv_70b (for comparison with its descendants)

And if it's not way too big to benchmark...

TheProfessor-155b (cognitivecomputations/dolphin-2.2-70b + WizardLM/WizardMath-70B-V1.0 + migtissera/SynthIA-70B-v1.2b + epfl-llm/meditron-70b)

Passing in model_kwargs

Hey there, I wanted some help running eq-bench. specifically i wanted to test different sampler settings (min_p). however i noticed in the configs that there's no way to specify model_kwargs / gen_kwargs, am I right? I'm not familiar with ooba, so maybe they're passed in that way? In that case I'd have to make a fork to pass in the configs. If it's useful maybe i can make a PR. Is my understanding correct?

New Command R 08-2024 and Command R+ 08-2024 models

Any chance of benchmarking these for the creative writing test?

The new smaller Command R+ 08-2024 now uses GQA so a lot more feasible for people to run on 24GB cards.

Trying to get to the bottom of why `Qwen1.5-110B-Chat` scores so much higher than the `command-r` models

https://github.com/EQ-bench/EQ-Bench/blob/main_v2_4/instruction-templates/Cohere.yaml

user: "<|USER_TOKEN|>"
bot: "<|CHATBOT_TOKEN|>"
turn_template: "<|START_OF_TURN_TOKEN|><|user|><|user-message|><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|bot|><|bot-message|>"
context: "<BOS_TOKEN>"
system_message: ""

Is there a chance that the "<BOS_TOKEN> was getting added twice during the tests:

https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/22#66179da37ed574892089967c

Not sure which backend was used to test against, but some will add is automatically and the HF config looks to have it both:

  "add_bos_token": true

and:

"chat_template": [
    {
      "name": "default",
      "template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}"
    },
    {
      "name": "tool_use",
      "template": "\n{%- macro json_to_python_type(json_spec) %}\n{%- set basic_type_map = {\n    \"string\": \"str\",\n    \"number\": \"float\",\n    \"integer\": \"int\",\n    \"boolean\": \"bool\"\n} %}\n\n{%- if basic_type_map[json_spec.type] is defined %}\n    {{- basic_type_map[json_spec.type] }}\n{%- elif json_spec.type == \"array\" %}\n    {{- \"List[\" +  json_to_python_type(json_spec.items) + \"]\"}}\n{%- elif json_spec.type == \"object\" %}\n    {{- \"Dict[str, \" + json_to_python_type(json_spec.additionalProperties) + ']'}}\n{%- elif json_spec.type is iterable %}\n    {{- \"Union[\" }}\n    {%- for t in json_spec.type %}\n      {{- json_to_python_type({\"type\": t}) }}\n      {%- if not loop.last %}\n        {{- \",\" }} \n    {%- endif %}\n    {%- endfor %}\n    {{- \"]\" }}\n{%- else %}\n    {{- \"Any\" }}\n{%- endif %}\n{%- endmacro %}\n\n{%- macro old_tool_parser(tools) %}\n{%- for tool in tools %}\n    {%- if loop.index0 != 0 %}\n        {{- '\\n\\n' }}\n    {%- endif %}\n    {{- '```python\\ndef ' + tool.name + '(' }}\n    {%- for param_name, param_fields in tool.parameter_definitions|items %}\n        {%- if loop.index0 != 0 %}\n            {{- ', '}}\n        {%- endif %}\n        {{- param_name + ': ' }}\n        {%- if not param_fields.required %}\n            {{- 'Optional[' + param_fields.type + '] = None'}}\n        {%- else %}\n            {{- param_fields.type }}\n        {%- endif %}\n    {%- endfor %}\n    {{- ') -> List[Dict]:\\n    \"\"\"'}}\n    {{- tool.description }}\n    {%- if tool.parameter_definitions|length != 0 %}\n        {{- '\\n\\n    Args:\\n        '}}\n        {%- for param_name, param_fields in tool.parameter_definitions|items %}\n            {%- if loop.index0 != 0 %}\n                {{- '\\n        ' }}\n            {%- endif %}\n            {{- param_name + ' ('}}\n            {%- if not param_fields.required %}\n                {{- 'Optional[' + param_fields.type + ']'}}\n            {%- else %}\n                {{- param_fields.type }}\n            {%- endif %}\n            {{- '): ' + param_fields.description }}\n        {%- endfor %}\n    {%- endif %}\n    {{- '\\n    \"\"\"\\n    pass\\n```' }}\n{%- endfor %}\n{%- endmacro %}\n\n{%- macro new_tool_parser(tools) %}\n{%- for tool in tools %}\n  {%- if loop.index0 != 0 %}\n    {{- '\\n\\n'}}\n  {%- endif %}\n  {%- if tool.function is defined %}\n    {%- set tool = tool.function %}\n  {%- endif %}\n  {{-'```python\ndef ' + tool.name + '('}}\n  {%- for param_name, param_fields in tool.parameters.properties|items %}\n    {%- if loop.index0 != 0 %}\n      {{- ', '}}\n    {%- endif %}\n    {{-param_name + \": \"}} \n    {%- if not param_name in tool.parameters.required %}\n      {{-'Optional[' + json_to_python_type(param_fields) + '] = None'}}\n    {%- else %}\n      {{- json_to_python_type(param_fields) }}\n    {%- endif %}\n  {%- endfor %}\n  {{- ') -> List[Dict]:\n    \"\"\"'}}\n  {{- tool.description }}\n  {%- if tool.parameters.properties|length != 0 %}\n    {{- '\\n\\n    Args:\\n        '}}\n    {%- for param_name, param_fields in tool.parameters.properties|items %}\n      {%- if loop.index0 != 0 %}\n        {{- '\\n        ' }}\n      {%- endif %}\n      {{- param_name + ' ('}}\n      {%- if not param_name in tool.parameters.required %}\n        {{-'Optional[' + json_to_python_type(param_fields) + ']'}}\n      {%- else %}\n        {{- json_to_python_type(param_fields) }}\n      {%- endif %}\n      {{- '): ' + param_fields.description }}\n    {%- endfor %}\n    {%- endif %}\n    {{- '\\n    \"\"\"\\n    pass\\n```' }}\n{%- endfor %}\n{%- endmacro %}\n\n{{- bos_token }}\n{%- if messages[0]['role'] == 'system' %}\n  {%- set loop_messages = messages[1:] %}\n  {%- set system_message = messages[0]['content'] %}\n{%- else %}\n  {%- set loop_messages = messages %}\n  {%- set system_message = '## Task and Context\\nYou help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user\\'s needs as best you can, which will be wide-ranging.\\n\\n## Style Guide\\nUnless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.' %}\n{%- endif %}\n{{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}\n{{- '# Safety Preamble' }}\n{{- '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}\n{{- '\n\n# System Preamble' }}\n{{- '\n## Basic Rules' }}\n{{- '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}\n{{- '\n\n# User Preamble' }}\n{{- '\n' + system_message }}\n{{-'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}\n{%- set ns = namespace(new_tools=true) %}\n{%- for tool in tools %}\n    {%- if tool.parameter_definitions is defined %}\n        {%- set ns.new_tools = false %}\n    {%- endif %}\n{%- endfor %}\n{%- if ns.new_tools %}\n    {{- new_tool_parser(tools) }}\n{%- else %}\n    {{- old_tool_parser(tools) }}\n{%- endif %}\n{{- '<|END_OF_TURN_TOKEN|>'}}\n{%- for message in loop_messages %}\n  {%- set content = message['content'] %}\n  {%- if message.role == 'user' %}\n    {{- '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content|trim + '<|END_OF_TURN_TOKEN|>' }}\n  {%- elif message.role == 'system' %}\n    {{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content|trim + '<|END_OF_TURN_TOKEN|>' }}\n  {%- elif message.role == 'assistant' and message.tool_calls is defined %}\n    {{- '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}\n    {%- if message.content is defined %}\n        {{- message.content|trim }}\n    {%- endif %}\n    {{- '\\nAction:\\n```json\\n[\\n' }}\n    {%- for tool_call in message.tool_calls %}\n        {%- if tool_call.function is defined %}\n            {%- set tool_call = tool_call.function %}\n        {%- endif %}\n        {{- '{\\n'|indent(4, first=true) }}\n        {{- '\"tool_name\": \"'|indent(8, first=true) + tool_call.name + '\",\\n' }}\n        {{- '\"parameters\": '|indent(8, first=true) }}\n        {%- if tool_call.arguments is defined and tool_call.arguments|length > 0 %}    \n            {{- tool_call.arguments|tojson(indent=4)|indent(8) }}\n            {{- '\\n' }}\n        {%- else %}\n            {{- '{}\\n' }}\n        {%- endif %}\n        {{- '}'|indent(4, first=true) }}\n        {%- if not loop.last %}\n            {{- ',\\n' }}\n        {%- endif %}\n    {%- endfor %}\n    {{- \"\\n]```\\n\" }}\n  {%- elif message.role == 'assistant' %}\n    {{- '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content|trim + '<|END_OF_TURN_TOKEN|>' }}\n  {%- elif message.role == 'tool' %}\n    {{- '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>\\n' }}\n    {{- message.content|trim }}\n    {{- '</results><|END_OF_TURN_TOKEN|>' }}\n  {%- endif %}\n{%- endfor %}\n{{-'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}\n{%- if add_generation_prompt %}\n  {{- '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}\n{%- endif %}\n"
    },
    {
      "name": "rag",
      "template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = '## Task and Context\\nYou help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user\\'s needs as best you can, which will be wide-ranging.\\n\\n## Style Guide\\nUnless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.' %}{% endif %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}{{ '# Safety Preamble' }}{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}{{ '\n\n# System Preamble' }}{{ '\n## Basic Rules' }}{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}{{ '\n\n# User Preamble' }}{{ '\n' + system_message }}{{ '<|END_OF_TURN_TOKEN|>'}}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'system' %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}{{ '<results>' }}{% for document in documents %}{{ '\nDocument: ' }}{{ loop.index0 }}\n{% for key, value in document.items() %}{{ key }}: {{value}}\n{% endfor %}{% endfor %}{{ '</results>'}}{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}{% if citation_mode=='accurate' %}{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}{% endif %}{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}{{ '<|END_OF_TURN_TOKEN|>' }}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}"
    }
  ]

The only reason I ask is I've been running some similar tests and I can't understand how Qwen1.5-110B-Chat scores so highly yet both command-r models score much lower?

I've tried all the qwen-2 and qwen-1.5 models and they all seem to be really bad at "in the style of" which a lot of your benchmarks look to ask for?

I'd love find out exactly what was getting sent as the template for qwen and command-r to see if I can get to the bottom of what is happening!

Just checked the chat-ml template and it is a little odd too:

https://github.com/EQ-bench/EQ-Bench/blob/main_v2_4/instruction-templates/ChatML.yaml

user: user
bot: assistant
turn_template: <|im_start|><|user|>\n<|user-message|><|im_end|>\n<|im_start|><|bot|>\n<|bot-message|><|im_end|>\n
context: |
  <|im_start|>system
  <|system-message|><|im_end|>

Not sure why there is an extra pipe, new line and spaces like that?

Input length of input_ids is 1211, but `max_length` is set to 1000. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`. Benchmark run failed

in config.cfg file i use the parameters:
[Benchmarks to run]
myrun1, Llama-v2, meta-llama/Llama-2-7b-chat-hf, , 8bit, 1, transformers, , ,

it is showing the error:
Failed to parse scores
99% 170/171 [01:18<00:00, 2.17it/s]
Input length of input_ids is 1211, but max_length is set to 1000. This can lead to unexpected behavior. You should consider increasing max_length or, better yet, setting max_new_tokens.
Benchmark run failed.

i think it is due to large input tokens of questions 170 or 171 how can i set max_length > 1200
please update the code or gave the solution to solve this error

Benchmark Failed

I am trying benchmarking new models, e.g.:
glm-4-9b-chat, , THUDM/glm-4-9b-chat, , , 1, transformers, , ,

python eq-bench.py --benchmarks eq-bench -v -r 1

...
Running benchmark 1 of 1

THUDM/glm-4-9b-chat
--------------
Iteration 1 of 1
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- tokenization_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- configuration_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- modeling_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 5558.31it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00,  1.26s/it]
  0%|                                                                                                                                                           | 0/171 [00:00<?, ?it/s]

eq-bench benchmark run failed.
Retrying 1 of 1
Iteration 1 of 1
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- tokenization_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- configuration_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- modeling_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 5711.98it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:13<00:00,  1.35s/it]
  0%|                                                                                                                                                           | 0/171 [00:00<?, ?it/s]

eq-bench benchmark run failed.
! eq-bench Benchmark Failed

Error in calculating revise answer score

When I call the eq-bench with -revise enabled it returns "! Error: 4 emotions were not returned" and revised_answers: {}