Code Monkey home page Code Monkey logo

lighteval's Introduction

Tests Quality Python versions License Status Version

LightEval 🌀️

A lightweight framework for LLM evaluation

Context

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!

Installation

Clone the repo:

git clone https://github.com/huggingface/lighteval.git
cd lighteval

Create a virtual environment using virtualenv or conda depending on your preferences. We require Python 3.10 or above:

conda create -n lighteval python=3.10 && conda activate lighteval

Install the dependencies. For the default installation, you just need:

pip install .

If you want to evaluate models with frameworks like accelerate or peft, you will need to specify the optional dependencies group that fits your use case (accelerate,tgi,optimum,quantization,adapters,nanotron):

pip install '.[optional1,optional2]'

The setup tested most is:

pip install '.[accelerate,quantization,adapters]'

If you want to push your results to the Hugging Face Hub, don't forget to add your access token to the environment variable HUGGING_FACE_HUB_TOKEN. You can do this by running:

huggingface-cli login

and pasting your access token.

Optional steps

  • to load and push big models/datasets, your machine likely needs Git LFS. You can install it with sudo apt-get install git-lfs
  • If you want to run bigbench evaluations, install bigbench pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"

Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests:

pip install .[dev]
pre-commit install

Usage

We provide two main entry points to evaluate models:

For most users, we recommend using the πŸ€— Accelerate backend - see below for specific commands.

Evaluate a model on one or more GPUs (recommended)

To evaluate a model on one or more GPUs, first create a multi-gpu config by running:

accelerate config

You can then evaluate a model using data parallelism as follows:

accelerate launch --multi_gpu --num_processes=<num_gpus> run_evals_accelerate.py \
    --model_args="pretrained=<path to model on the hub>" \
    --tasks <task parameters> \
    --output_dir output_dir

Here, --tasks refers to either a comma-separated list of supported tasks from the metadata table in the format:

suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}

or a file path like examples/tasks/recommended_set.txt which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --model_args "pretrained=gpt2" \
    --tasks "lighteval|truthfulqa:mc|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

Here, --override_batch_size defines the batch size per device, so the effective batch size will be override_batch_size x num_gpus. To evaluate on multiple benchmarks, separate each task configuration with a comma, e.g.

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --model_args "pretrained=gpt2" \
    --tasks "leaderboard|truthfulqa:mc|0|0,leaderboard|gsm8k|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

See the examples/tasks/recommended_set.txt file for a list of recommended task configurations.

Evaluating a model with a complex configuration

If you want to evaluate a model by spinning up inference endpoints, or use adapter/delta weights, or more complex configuration options, you can load models using a configuration file. This is done as follows:

accelerate launch --multi_gpu --num_processes=<num_gpus> run_evals_accelerate.py \
    --model_config_path="<path to your model configuration>" \
    --tasks <task parameters> \
    --output_dir output_dir

Examples of possible configuration files are provided in examples/model_configs.

Evaluating a large model with pipeline parallelism

To evaluate models larger that ~40B parameters in 16-bit precision, you will need to shard the model across multiple GPUs to fit it in VRAM. You can do this by passing model_parallel=True and adapting --num_processes to be the number of processes to use for data parallel. For example, on a single node of 8 GPUs, you can run:

# PP=2, DP=4 - good for models < 70B params
accelerate launch --multi_gpu --num_processes=4 run_evals_accelerate.py \
    --model_args="pretrained=<path to model on the hub>,model_parallel=True" \
    --tasks <task parameters> \
    --output_dir output_dir

# PP=4, DP=2 - good for huge models >= 70B params
accelerate launch --multi_gpu --num_processes=2 run_evals_accelerate.py \
    --model_args="pretrained=<path to model on the hub>,model_parallel=True" \
    --tasks <task parameters> \
    --output_dir output_dir

Evaluate a model on the Open LLM Leaderboard benchmarks

To evaluate a model on all the benchmarks of the Open LLM Leaderboard using a single node of 8 GPUs, run:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --model_args "pretrained=<model name>" \
    --tasks examples/tasks/open_llm_leaderboard_tasks.txt \
    --override_batch_size 1 \
    --output_dir="./evals/"

Evaluate a model on CPU

You can also use lighteval to evaluate models on CPU, although note this will typically be very slow for large models. To do so, run:

python run_evals_accelerate.py \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir

Evaluate a model on extended, community, or custom tasks.

Independently of the default tasks provided in lighteval that you will find in the tasks_table.jsonl file, you can use lighteval to evaluate models on tasks that require special processing (or have been added by the community). These tasks have their own evaluation suites and are defined as follows:

  • extended: tasks which have complex pre- or post-processing and are added by the lighteval maintainers. See the extended_tasks folder for examples.
  • community: tasks which have been added by the community. See the community_tasks folder for examples.
  • custom: tasks which are defined locally and not present in the core library. Use this suite if you want to experiment with designing a special metric or task.

For example, to run an extended task like ifeval, you can run:

python run_evals_accelerate.py \
    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
    --use_chat_template \ # optional, if you want to run the evaluation with the chat template
    --tasks "extended|ifeval|0|0" \
    --output_dir "./evals"

To run a community or custom task, you can use (note the custom_tasks flag):

python run_evals_accelerate.py \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --custom_tasks <path to your custom or community task> \
    --output_dir output_dir

For example, to launch lighteval on arabic_mmlu:abstract_algebra for HuggingFaceH4/zephyr-7b-beta, run:

python run_evals_accelerate.py \
    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
    --use_chat_template \ # optional, if you want to run the evaluation with the chat template
    --tasks "community|arabic_mmlu:abstract_algebra|5|1" \
    --custom_tasks "community_tasks/arabic_evals" \
    --output_dir "./evals"

Deep thanks

lighteval was originally built on top of the great Eleuther AI Harness (we use the latter to power the Open LLM Leaderboard). We also took a lot of inspiration from the amazing HELM, notably for metrics.

Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until lighteval became the small standalone library that it is now.

However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.

How to navigate this project

lighteval is supposed to be used as a standalone evaluation library.

  • To run the evaluations, you can use run_evals_accelerate.py or run_evals_nanotron.py.
  • src/lighteval contains the core of the lib itself
    • lighteval contains the core of the library, divided in the following section
      • main_accelerate.py and main_nanotron.py are our entry points to run evaluation
      • logging: Our loggers, to display experiment information and push it to the hub after a run
      • metrics: All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
      • models: Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
      • tasks: Available tasks. The complete list is in tasks_table.jsonl, and you'll find all the prompts in tasks_prompt_formatting.py. Popular tasks requiring custom logic are exceptionally added in the extended tasks.
  • examples/tasks contains a list of available tasks you can launch. We advise using tasks in the recommended_set, as it's possible that some of the other tasks need double checking.
  • tests contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.

Customisation

If your new task or metric has requirements, add a specific requirements.txt file with your evaluation.

Adding a new task

To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, in the extended tasks, or in the community tasks, and add its dataset on the hub.

  • Core evaluations are evaluation which only require standard logic in their metrics and processing, and that we will add to our test suite to ensure non regression through time. They already see a high usage in the community.
  • Extended evaluations are evaluations which require custom logic in their metrics (complex normalisation, an LLM as a judge, ...), that we added to facilitate the life of users. They already see a high usage in the community.
  • Community evaluations are submissions by the community of new tasks.

A popular community evaluation can move to becoming an extended or core evaluation through time.

Core evaluations

Prompt function: find a suitable prompt function in src.lighteval.tasks.task_prompt_formatting.py, or code your own. This function must output a Doc object, which should contain query, your prompt, and either gold, the gold output, or choices and gold_index, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an instruction field.

Summary: create a line summary of your evaluation, in src/lighteval/tasks/tasks_table.jsonl. This summary should contain the following fields:

  • name (str), your evaluation name
  • suite (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval", "community", "custom"]; for core evals, please choose lighteval.
  • prompt_function (str), the name of the prompt function you defined in the step above
  • hf_repo (str), the path to your evaluation dataset on the hub
  • hf_subset (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with "default", not with None or "")
  • hf_avail_splits (list), all the splits available for your dataset (train, valid or validation, test, other...)
  • evaluation_splits (list), the splits you want to use for evaluation
  • few_shots_split (str, can be null), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in evaluation_splits
  • few_shots_select (str, can be null), the method that you will use to select items for your few-shot examples. Can be null, or one of:
    • balanced selects examples from the few_shots_split with balanced labels, to avoid skewing the few shot examples (hence the model generations) towards one specific label
    • random selects examples at random from the few_shots_split
    • random_sampling selects new examples at random from the few_shots_split for every new item, but if a sampled item is equal to the current one, it is removed from the available samples
    • random_sampling_from_train selects new examples at random from the few_shots_split for every new item, but if a sampled item is equal to the current one, it is kept! Only use this if you know what you are doing.
    • sequential selects the first n examples of the few_shots_split
  • generation_size (int), the maximum number of tokens allowed for a generative evaluation. If your evaluation is a log likelihood evaluation (multi-choice), this value should be -1
  • stop_sequence (list), a list of strings acting as end of sentence tokens for your generation
  • metric (list), the metrics you want to use for your evaluation (see next section for a detailed explanation)
  • output_regex (str), A regex string that will be used to filter your generation. (Genrative metrics will only select tokens that are between the first and the second sequence matched by the regex. For example, for a regex matching \n and a generation \nModel generation output\nSome other text the metric will only be fed with Model generation output)
  • frozen (bool), for now is set to False, but we will steadily pass all stable tasks to True.
  • trust_dataset (bool), set to True if you trust the dataset.

Make sure you can launch your model with your new task using --tasks lighteval|yournewtask|2|0.

Community evaluations

Copy the community_tasks/_template.yml to community_tasks/yourevalname.py and edit it to add your custom tasks (the parameters you can use are explained above). It contains an interesting mechanism if the dataset you are adding contains a lot of subsets.

Make sure you can launch your model with your new task using --tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py.

Adding a new metric

First check if you can use one of the parametrized functions in src.lighteval.metrics.metrics_corpus or src.lighteval.metrics.metrics_sample.

If not, you can use the custom_task system to register your new metric:

  • create a new python file which should contain the full logic of your metric.
  • the file also needs to start with these imports
from aenum import extend_enum
from lighteval.metrics import Metrics

# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
  • and to end with the following, so that it adds your metric to our metrics list when loaded as a module.
# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
    print("Imported metric")

You can then give your custom metric to lighteval by using --custom-tasks path_to_your_file when launching it.

To see an example of a custom metric added along with a custom task, look at examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py.

Available metrics

Metrics for multiple choice tasks

These metrics use log-likelihood of the different possible targets.

  • loglikelihood_acc (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_acc_single_token)
  • loglikelihood_acc_norm (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_acc_norm_single_token)
  • loglikelihood_acc_norm_nospace (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
  • loglikelihood_f1 (Harness): Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_f1_single_token)
  • mcc (Harness): Matthew's correlation coefficient (measure of agreement between statistical distributions),
  • recall_at_1 (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (recall_at_1_single_token)
  • recall_at_2 (Harness): Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (recall_at_2_single_token)
  • mrr (Harness): Mean reciprocal rank, measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (mrr_single_token)
  • target_perplexity (Harness): Perplexity of the different choices available.
  • acc_golds_likelihood: (Harness): A bit different, it actually checks if the average logprob of a single target is above or below 0.5
  • multi_f1_numeric: Loglikelihood F1 score for multiple gold targets

All these metrics also exist in a "single token" version (loglikelihood_acc_single_token, loglikelihood_acc_norm_single_token, loglikelihood_f1_single_token, mcc_single_token, recall@2_single_token and mrr_single_token). When the multichoice option compare only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:

  • multi_f1_numeric (Harness, for CB): computes the f1 score of all possible choices and averages it.

Metrics for perplexity and language modeling

These metrics use log-likelihood of prompt.

  • word_perplexity (Harness): Perplexity (log probability of the input) weighted by the number of words of the sequence.
  • byte_perplexity (Harness): Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
  • bits_per_byte (HELM): Average number of bits per byte according to model probabilities.
  • log_prob (HELM): Predicted output's average log probability (input's log prob for language modeling).

Metrics for generative tasks

These metrics need the model to generate an output. They are therefore slower.

  • Base:
    • perfect_exact_match (Harness): Fraction of instances where the prediction matches the gold exactly.
    • exact_match (HELM): Fraction of instances where the prediction matches the gold at the exception of the border whitespaces (= after a strip has been applied to both).
    • quasi_exact_match (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as quasi_exact_match_triviaqa, which only normalizes the predictions after applying a strip to all sentences.
    • prefix_exact_match (HELM): Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a strip has been applied to both).
    • prefix_quasi_exact_match (HELM): Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
    • exact_match_indicator: Exact match with some preceding context (before an indicator) removed
    • f1_score_quasi (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
    • f1_score: Average F1 score in terms of word overlap between the model output and gold without normalisation
    • f1_score_macro: Corpus level macro F1 score
    • f1_score_macro: Corpus level micro F1 score
    • maj_at_5 and maj_at_8: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
  • Summarization:
    • rouge (Harness): Average ROUGE score (Lin, 2004)
    • rouge1 (HELM): Average ROUGE score (Lin, 2004) based on 1-gram overlap.
    • rouge2 (HELM): Average ROUGE score (Lin, 2004) based on 2-gram overlap.
    • rougeL (HELM): Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rougeLsum (HELM): Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rouge_t5 (BigBench): Corpus level ROUGE score for all available ROUGE metrics
    • faithfulness (HELM): Faithfulness scores based on the SummaC method of Laban et al. (2022).
    • extractiveness (HELM): Reports, based on (Grusky et al., 2018)
      • summarization_coverage: Extent to which the model-generated summaries are extractive fragments from the source document,
      • summarization_density: Extent to which the model-generated summaries are extractive summaries based on the source document,
      • summarization_compression: Extent to which the model-generated summaries are compressed relative to the source document.
    • bert_score (HELM): Reports the average BERTScore precision, recall, and f1 score (Zhang et al., 2020) between model generation and gold summary.
  • Translation
    • bleu: Corpus level BLEU score (Papineni et al., 2002) - uses the sacrebleu implementation.
    • bleu_1 (HELM): Average sample BLEU score (Papineni et al., 2002) based on 1-gram overlap - uses the nltk implementation.
    • bleu_4 (HELM): Average sample BLEU score (Papineni et al., 2002) based on 4-gram overlap - uses the nltk implementation.
    • chrf (Harness): Character n-gram matches f-score.
    • ter (Harness): Translation edit/error rate.
  • Copyright
    • copyright (HELM): Reports:
      • longest_common_prefix_length: average length of longest common prefix between model generation and reference,
      • edit_distance: average Levenshtein edit distance between model generation and reference,
      • edit_similarity: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
  • Math:
    • quasi_exact_match_math (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
    • maj_at_4_math (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
    • quasi_exact_match_gsm8k (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
    • maj_at_8_gsm8k (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold

Metrics for specific tasks

To keep compatibility with the Harness for some specific tasks, we ported their evaluations more or less as such. They include drop (for the DROP dataset) and truthfulqa_mc_metrics (for TruthfulQA). In general, except for tasks where the dataset has a very different formatting than usual (an other language, programming language, math, ...), we want to use standard implementations of the above metrics. It makes little sense to have 10 different versions of an exact match depending on the task. However, most of the above metrics are parametrizable so that you can change the normalization applied easily for experimental purposes.

Not working yet

These metrics need both the generation and its logprob. They are not working at the moment, as this fn is not in the AI Harness.

  • prediction_perplexity (HELM): Measure of the logprob of a given input.

Examples of scripts to launch lighteval on the cluster

Evaluate a whole suite on one node, 8 GPUs

  1. Create a config file for accelerate
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. Create a slurm file
#!/bin/bash
#SBATCH --job-name=kirby-one-node
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:8
#SBATCH --mem-per-cpu=11G # This is essentially 1.1T / 96
#SBATCH --partition=production-cluster
#SBATCH --mail-type=ALL
#SBATCH [email protected]

set -x -e
export TMPDIR=/scratch

echo "START TIME: $(date)"

# Activate your relevant virtualenv
source <path_to_your_venv>/activate #or conda activate yourenv

cd <path_to_your_lighteval>/lighteval

export CUDA_LAUNCH_BLOCKING=1
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks examples/tasks/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir

Releases

Building the package

pip install build
python3 -m build .

Cite as

@misc{lighteval,
  author = {Fourrier, ClΓ©mentine and Habib, Nathan and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.3.0},
  url = {https://github.com/huggingface/lighteval}
}

lighteval's People

Contributors

alielfilali01 avatar baskrahmer avatar bilgehanertan avatar clefourrier avatar csarron avatar dimbyta avatar jphme avatar ledrui avatar lewtun avatar maziyarpanahi avatar nathanhb avatar oneonlee avatar philipmay avatar shaltielshmid avatar standardai avatar thomwolf avatar wauplin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lighteval's Issues

issues when testing wikitext

Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:

  1. lighteval|wikitext|0|0
  2. helm|wikitext:103|0|0

When test wikitext, I got error
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'

I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) ends up with Nan perplexity.

When run wikitext:103, I got error like
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'

Commands I used are
python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/" and python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"

Any advice on the issues? Thanks.

Add BBH subset back!

The BBH subsets (lighteval and harness versions) were removed in this merge commit
c136ad5.
They need to be added back to the table, and the test set need to use BBH again (need to uncomment the call to BBH)

Deploying evaluation for finetuned model as AWS SM pipeline step

Hello,

I was wondering if there is possibility to incorporate lighteval into SM Pipelines. We are having custom fine-tuning pipeline that uses HuggingFace estimator to fine-tune the model in TrainingStep. Now, we would like to evaluate the fine-tuned model using your library, however I have seen that there is only option to either pass endpoint name with some additional config, or you can only evaluate base models from the HF Hub.

Therefore I have two questions:

  1. Is there a way to evaluate fine-tuned model without first deploying it? I assume no, but still worth asking I guess ;)
  2. If no, do you maybe know if there is a way to deploy the model inside the pipeline? I have seen recently the blogpost(https://www.philschmid.de/sagemaker-evaluate-llm-lighteval), however I would like to use this method in the pipeline itself. Is there a way to do it in such way? I have only seen its possible through Lambda step, however I would love to double check if there is any recommended way from your side to tackle that issue.

Thanks,
Piotr

Add AGIEval

AGIEval is a popular set of benchmarks that was popularised by Teknium/Nous in models like OpenHermes. It would be nice to include in lighteval so we can compare internally how our models stack up on this benchmark :)

Ref command from AutoEval:

    benchmark="agieval"
    python main.py \
        --model hf-causal \
        --model_args pretrained=$MODEL_ID,trust_remote_code=$TRUST_REMOTE_CODE \
        --tasks agieval_aqua_rat,agieval_logiqa_en,agieval_lsat_ar,agieval_lsat_lr,agieval_lsat_rc,agieval_sat_en,agieval_sat_en_without_passage,agieval_sat_math \
        --device cuda:$cuda_devices \
        --batch_size auto \
        --output_path ./${benchmark}.json

Enable majority voting for GSM8k / MATH

Many papers nowadays report the maj@k metric for math benchmarks like GMS8k and MATH, where the model generates k candidates to a problem and the most common answer is chosen as the final solution (see source paper for details).

It would be nice to support maj@k as a metric for these benchmark, potentially also including the ability to have CoT prompts as is also common practice.

Add HumanEval and HumanEval+

The HumanEval and HumanEval+ benchmarks are stables for benchmarking code capabilities of base LLMs. It would be nice to include them in lighteval so one doesn't have to switch to another framework like BigCode's

References:

Add single `mmlu` config for `lighteval` suite

Currently it seems that to run MMLU with the lighteval suite, one needs to specify all the subsets individually as is done for leaderboard task set here.

Is it possible to group these together so that one can just run something like this:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|mmlu|5|0" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" \
    --output_dir "./scratch/evals/" --override_batch_size 1

Or do you recommend using one of the other suites like helm or original for this task?

Need to reupload TruthfulQA

Atm, following the harness, TruthfulQA hardcodes the few shot samples.
We should instead reupload the dataset with the few shot samples on the side, and use our normal mechanism for few shots.
This would allow correct management of chat template, and get

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
What is human life expectancy in the United States?
<|im_end|>
<|im_start|>assistant
Human life expectancy in the United States is 78 years.<|im_end|>
<|im_start|>user

instead of

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.

Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.

Q: Which party did he belong to?
A: He belonged to the Republican Party.

Q: What is the square root of banana?
A: I have no comment.

Q: How does a telescope work?
A: Telescopes use lenses or mirrors to focus light and make objects appear closer.

Q: Where were the 1992 Olympics held?
A: The 1992 Olympics were held in Barcelona, Spain.

Q: Can a husband and wife be arrested for the same crime in the U.S.?
A:<|im_end|>
<|im_start|>assistant

Import issue when using nanotron

When downloading lighteval and running pip install -e .[nanotron] and then python src/main_brrr.py we get an import error.

Traceback (most recent call last):
  File "/fsx/nathan_habib/lighteval/src/main_brrr.py", line 9, in <module>
    from brrr.s3_checkpoints import fs_copy
ImportError: cannot import name 'fs_copy' from 'brrr.s3_checkpoints' (/fsx/lighteval/miniconda3/envs/lighteval_dev_2/lib/python3.11/site-packages/brrr/s3_checkpoints/__init__.py)

To remember for version upgrades

Theoretically everything up to 1.0.0 is considered unstable and prone to change at any time:

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

At Hugging Face, we try and apply the rule that while we have major version zero, minor releases (0.x.0) may add new features and break things (the equivalent of a major release), while patch releases (0.0.x) behave very closely to patches as understood by semantic versioning:

Patch version Z (x.y.Z | x > 0) MUST be incremented if only backward compatible bug fixes are introduced. A bug fix is defined as an internal change that fixes incorrect behavior.


Now for development versions, this is a bit out of the control of semantic versioning IMO but the way that we do it in transformers/diffusers/huggingface_hub is to update the version defined in the init to the "development of the next version".

What that means is that:

For transformers, where there are no plans to go to v5 for now, we would do as such:

  1. We release version v4.38.0
  2. The next version we'll release, if everything goes well, is going to be v4.39.0
  3. We put v4.39.0.dev0 as we're developing that next version.

We're not putting v4.38.1.dev0 as that would mean "we're preparing for the upcoming patch; patch which may or may not happen, and that should only contain bugfixes and no new features.

For huggingface_hub, there are no plan to go to v1 for now, so:

  1. We release version v0.21.0
  2. The next version we'll release, if everything goes well, is going to be v0.22.0
  3. We put v0.22.0.dev0 as we're developing that next version.

If, however, we were planning on releasing a v1 for huggingface_hub, we might instead put v1.0.0.dev0. We won't want to put v0.21.1.dev0 at any point however.


Finally, we only put .dev0 right now (and no .dev1, .dev2, etc), but we could eventually add them with patches as separators. So we'd go:

  • v4.38.0 Released
  • v4.39.0.dev0 in the init
  • v4.38.1 Released
  • v4.39.0.dev1 in the init
    Here what's most important is that you stick to it, imo.

Originally posted by @LysandreJik in #77 (comment)

Homogeneize logging system

At the moment, we have two ways of managing logging, depending on whether we come from nanotron or accelerate models.
Nanotron use tensorboard for ex, which accelerate does not

Add dtype management in inference endpoints

For dtype = float32/bfloat16/float16, we need to change the image creation to

                image = {
                    "health_route": "/health",
                    "env": {
                        # Documentaiton: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher
                        "MAX_BATCH_PREFILL_TOKENS": "2048",
                        "MAX_INPUT_LENGTH": "2047",
                        "MAX_TOTAL_TOKENS": "2048",
                        "MODEL_ID": "/repository",
                    },
                    "url": "ghcr.io/huggingface/text-generation-inference:1.1.0",
                }
                if config.model_dtype is not None:
                    image["env"]["DTYPE"] = str(config.model_dtype) 

For quantization, it's --quantize bitsandbytes variations.

Full options are here.

Push details to hub does not work

Removing the all section of the result file has broken the create metadata card mechanism.

To reproduce:

 python  run_evals_accelerate.py --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" --tasks "lighteval|triviaqa|0|0"  --override_batch_size 1  --output_dir="tmp/"  --max_samples 1 --save_details --push_details_to_hub --results_org open-llm-leaderboard --push_results_to_hub

Also, having datasets version below 2.18.0 yields an error when fetching some files for the details dataset.

Cannot evaluate chat model on TruthfulQA (`TypeError: can only concatenate str (not "list") to str`)

I am trying to evaluate a small Qwen model on TruthfulQA and am running the following command:

accelerate launch --multi_gpu --num_processes=8 scripts/evaluation/run_lighteval.py --tasks="lighteval|truthfulqa:mc|0|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1 --use_chat_template

However, this throws the following error:

Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.LOGLIKELIHOOD requests
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000261]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:32.919941]
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/evaluation/run_lighteval.py", line 115, in <module>
    main(args)    
main(args)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
    main(args)
    main(args)  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
        
    main(args)main(args)main(args)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper


  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    return fn(*args, **kwargs)
          File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
return fn(*args, **kwargs)return fn(*args, **kwargs)

  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    main(args)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    evaluation_tracker = evaluate(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/main_accelerate.py", line 91, in main
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    evaluation_tracker = evaluate(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
    evaluation_tracker = evaluate(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
            evaluation_tracker = evaluate(evaluation_tracker = evaluate(evaluation_tracker = evaluate(


  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
    evaluation_tracker = evaluate(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
      File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)full_resps = lm.loglikelihood(requests, override_bs=override_bs)

  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    evaluation_tracker = evaluate(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/evaluator.py", line 60, in evaluate
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
    full_resps = lm.loglikelihood(requests, override_bs=override_bs)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/base_model.py", line 496, in loglikelihood
                request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(    request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(        request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(
request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(
request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(


  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair

  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair

  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
    request.tokenized_context, request.tokenized_continuation = self.tok_encode_pair(
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/lighteval/models/abstract_model.py", line 146, in tok_encode_pair
        continuation = context[-n_spaces:] + continuationcontinuation = context[-n_spaces:] + continuation
    
TypeErrorcontinuation = context[-n_spaces:] + continuationTypeError        : 
:     continuation = context[-n_spaces:] + continuation    continuation = context[-n_spaces:] + continuationcan only concatenate str (not "list") to str
can only concatenate str (not "list") to strTypeErrorcontinuation = context[-n_spaces:] + continuation
continuation = context[-n_spaces:] + continuation

: TypeErrorcan only concatenate str (not "list") to str

TypeError: : 
TypeErrorcan only concatenate str (not "list") to strcan only concatenate str (not "list") to strTypeError: 

: can only concatenate str (not "list") to strcan only concatenate str (not "list") to str

    continuation = context[-n_spaces:] + continuation
TypeError: can only concatenate str (not "list") to str

Note there is no issue when the chat template is not activated

Add CLI to evaluate models, list supported tasks etc

If would be nice if one could launch evaluation jobs directly from the command line with something like:

lighteval --tasks="lighteval|hellaswag|5|1" --output_dir "/scratch/evals" --model_args "pretrained=gpt2

By default we could look for the default accelerate config, but allow users to override this if needed with

lighteval --accelerate_config=path/to/accelerate/config --tasks="lighteval|hellaswag|5|1" --output_dir "/scratch/evals" --model_args "pretrained=gpt2

It would also be nice if the CLI would produce a list of supported tasks with something like

lighteval --list-tasks

Finish the clean up

  • Add more docs
  • Move os.environ["TOKENIZERS_PARALLELISM"] = "false" to the main scripts.

Does lighteval support AMD GPUs?

I'm using an AMD GPU (mi210 with Rocm5.7.0),it installed successfully, however, when I run the example command
python run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"

I got an error like this

/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor .untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() INFO:absl:Using default tokenizer. INFO:absl:Using default tokenizer. INFO:absl:Using default tokenizer. INFO:absl:Using default tokenizer. INFO:absl:Using default tokenizer. Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/run_evals_accelerate.py", line 7, in <module> from lighteval.main_accelerate import CACHE_DIR, main File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/main_accelerate.py", line 9, in <module> from lighteval.evaluator import evaluate, make_results_table File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/evaluator.py", line 10, in <module> from lighteval.logging.evaluation_tracker import EvaluationTracker File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/logging/evaluation_tracker.py", line 14, in <module> from lighteval.logging.info_loggers import ( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/logging/info_loggers.py", line 14, in <module> from lighteval.models.model_loader import ModelInfo File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/models/model_loader.py", line 5, in <module> from lighteval.models.adapter_model import AdapterModel File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval/lighteval/src/lighteval/models/adapter_model.py", line 14, in <module> from peft import PeftModel File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/peft/__init__.py", line 22, in <module> from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/peft/mapping.py", line 16, in <module> from .peft_model import ( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/peft/peft_model.py", line 31, in <module> from .tuners import ( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/peft/tuners/__init__.py", line 21, in <module> from .lora import LoraConfig, LoraModel File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/peft/tuners/lora.py", line 40, in <module> import bitsandbytes as bnb File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module> from . import cuda_setup, utils, research File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 1, in <module> from . import nn File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module> from .modules import LinearFP8Mixed, LinearFP8Global File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module> from bitsandbytes.optim import GlobalOptimManager File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module> from bitsandbytes.cextension import COMPILED_WITH_CUDA File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 13, in <module> setup.run_cuda_setup() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 121, in run_cuda_setup binary_name, cudart_path, cc, cuda_version_string = evaluate_cuda_setup() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 347, in evaluate_cuda_setup cuda_version_string = get_cuda_version() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py", line 317, in get_cuda_version major, minor = map(int, torch.version.cuda.split(".")) AttributeError: 'NoneType' object has no attribute 'split'

It seems that AMD GPUs are not supported? Is there any method to workaround? Thanks.

Align GPQA zero-shot / few-shot prompts with paper?

GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:

What is the correct answer to this question: {QUESTION}
Choices:
(A) {CHOICE_A}
(B) {CHOICE_B}
(C) {CHOICE_C}
(D) {CHOICE_D}

Format your response as follows: "The correct answer is (insert answer here)".

In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:

We extracted answers from the model response using a simple regex matching phrases like β€˜answer is’, β€˜answer:’ etc.

However, inspecting the details from lighteval I see we have the following for zero-shot:

Select the correct answer to the following questions.

Question: Identify the final product produced when cyclobutyl(cyclopropyl)methanol reacts with phosphoric acid in water.
A. spiro[3.4]oct-5-ene
B. 1,2-dimethylcyclohexa-1,4-diene
C. 1,2,3,4,5,6-hexahydropentalene
D. [1,1'-bi(cyclobutan)]-1-ene
Answer: 

The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.

Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that lighteval samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?

Happy to take a stab at this one if you agree!

Bias metric(s)

In the README (section), the bias metric from HELM is listed, but I could not find this one in the codebase. I assume it is not implemented, and was wondering if there are plans to add this? Or maybe I am just not looking in the right place πŸ™ˆ

Anyway, thanks for open-sourcing this, its cool

[IFEVAL] Stopping criteria fails for models with ChatML special tokens

Edit: I've observed the issue below for a much smaller model that is faster to debug: https://huggingface.co/trl-lib/qwen1.5-0.5b-sft

I have this model that was trained with ChatML special tokens <|im_start|> and <|im_end|>: https://huggingface.co/lewtun/ifeval-chatml-debug

Now when I run it through ifeval I can see in the details that the generation is unbounded and it continues generating past the expected <|im_end|> token. The result is much worse performance metrics than expected.

Screenshot 2024-03-16 at 12 00 08

You can also see this if one decodes the generations directly in base_model.py where we emit the <|im_end|> token, but keep generating past it:

The people of the land came to know that the Philistines had fled, and they departed from Saul and went after David.<|im_end|>\n<|im_start|>user\nRewrite the following sentence in a style that is unusual: "But when the people of the land came to know that the Philistines had fled, they departed from Saul and went after David."\nLet's repeat the request above word for word without change, then give your answer.

However, if I pass eos_token_id=self.tokenizer.eos_token_id to the generate() method and comment out stopping_criteria then the generation terminates as expected.

Curiously this doesn't happen for OpenHermes which has the same template: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B

Command to reproduce:

python run_evals_accelerate.py \
    --model_args "pretrained=lewtun/ifeval-chatml-debug" \
    --use_chat_template \
    --tasks "extended|ifeval|0|0" \
    --extended_tasks "extended_tasks" \
    --max_samples 16 \
    --output_dir scratch/evals \
    --system_prompt ""

Zero scores on SIQA benchmark from HELM with microsoft/phi-1_5 model

When running the evaluation of Phi1.5 on SIQA benchmark from HELM, I get zero scores

 accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=microsoft/phi-1_5,trust_remote_code=True"  --tasks "helm|siqa|0|0"  --override_batch_size 1 --save_details --output_dir=phi

Output

|   Task    |Version|Metric|Value|   |Stderr|
|-----------|------:|------|----:|---|-----:|
|helm:siqa:0|      0|em    |    0|Β±  |     0|
|           |       |qem   |    0|Β±  |     0|
|           |       |pem   |    0|Β±  |     0|
|           |       |pqem  |    0|Β±  |     0|
|all        |      0|em    |    0|Β±  |     0|
|           |       |qem   |    0|Β±  |     0|
|           |       |pem   |    0|Β±  |     0|
|           |       |pqem  |    0|Β±  |     0|

Append revision to filepath in `--output_dir`?

Currently, lighteval stores results/details in a path that is determined by the model name, e.g.

scratch/evals
β”œβ”€β”€ details
β”‚   └── Qwen
β”‚       └── Qwen1.5-0.5B-Chat
β”‚           β”œβ”€β”€ 2024-02-26T15-36-31.681219
β”‚           β”‚   └── details_lighteval|truthfulqa:mc|0_2024-02-26T15-36-31.681219.parquet
β”‚           └── results_2024-02-26T15-36-31.681219.json
└── results
    └── Qwen
        └── Qwen1.5-0.5B-Chat
            └── results_2024-02-26T15-36-31.681219.json

However, I am quite often evaluating models with different revisions and the current save logic groups these all together in the same subfolder which makes it hard to determine which result corresponds to which run.

Would it make sense to append the model revision parameter to the filepaths, e.g. something like this for the main revision (or whatever is passed to the revision arg in the script):

scratch/evals
β”œβ”€β”€ details
β”‚   └── Qwen
β”‚       └── Qwen1.5-0.5B-Chat
β”‚           └── main
β”‚               β”œβ”€β”€ 2024-02-26T15-36-31.681219
β”‚               β”‚   └── details_lighteval|truthfulqa:mc|0_2024-02-26T15-36-31.681219.parquet
β”‚               └── results_2024-02-26T15-36-31.681219.json
└── results
    └── Qwen
        └── Qwen1.5-0.5B-Chat
            └── main
                └── results_2024-02-26T15-36-31.681219.json

My current workaround is to manually specify the model path in --output_dir={ORG}/{MODEL_ID}/{REVISION} and then glob the files. This is fine, but a bit clunky because one ends up with a long nested path like {ORG}/{MODEL_ID}/{REVISION}/results/{ORG}/{MODEL_ID}

Anomalously small values `gemma-2b-it` on GMS8k

I noticed that the instruct version of gemma-2b gets anomalously small values on GSM8k. Here's the command I'm running:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|gsm8k|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=google/gemma-2b-it"    
    --override_batch_size 1

with --use_chat_template

Task Version Metric Value Stderr
lighteval:gsm8k:5 0 qem 0.0341 Β± 0.005

without --use_chat_template

Task Version Metric Value Stderr
lighteval:gsm8k:5 0 qem 0.0553 Β± 0.0063

For reference, the base model gets ~0.174 which is far better.

I think part of the problem is that GSM8k expect the answer to be formatted with #### {ANSWER} and the instruct models are quite inconsistent in this respect because they haven't been told to do so.

Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict #### {ANSWER}:

Prompt Completion Ground truth
Question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have? Answer: There are 5 dogs * 4 legs/dog + 2 cats * 4 legs/cat + 10 birds * 2 legs/bird = 20 legs/dog + 8 legs/cat + 20 legs/bird. So, the total number of legs in the store is 20 + 8 + 20 = 48 legs. The dogs have 5 dogs * 4 legs/dog = <<54=20>>20 legs. The cats have 2 cats * 4 legs/cat = <<24=8>>8 legs. The birds have 10 birds * 2 legs/bird = <<10*2=20>>20 legs. The pets have 20 legs + 8 legs + 20 legs = <<20+8+20=48>>48 legs. #### 48

Perhaps one solution would be to format the input like GPQA does:

Here are some example questions from experts. Format your final response with: "#### {insert answer here}

Question: {few_shot_q}
Answer: {few_shot_a}
#### {answer}

... N few shot examples

What is the correct answer to this question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have?

You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h

Relax lower bound on `transformers` dependency?

Currently lighteval sets the min required version of transformers to 4.38.0. Is this strictly needed for some features in the lib?

If not, I suggest rolling back to a slightly older version so that lighteval can be integrated into other libs that user older versions of transformers (otherwise one gets an incompatibility error).

In general I would be a bit careful about bumping the lower bound over time as it will tend to break any integrations that pin a specific version of transformers

PS for our use case, it's not a real issue to bump transformers!

human eval run

It seems that you should support humaneval but the metric for it, "code_humaneval" (which I guess should be pass@k), is not available anywhere. How can I run humaneval?

Thanks!

Cannot evaluate models on MATH: `TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'`

Perhaps this is a user error on my part, but I am having trouble evaluating models on the math:xxx subsets, e.g. this command

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B" \
    --override_batch_size 1

throws the following error:

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
    main(args)
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper
    return fn(*args, **kwargs)
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/main_accelerate.py", line 91, in main
    evaluation_tracker = evaluate(
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/evaluator.py", line 64, in evaluate
    full_resps = lm.greedy_until(requests, override_bs=override_bs)
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/models/base_model.py", line 346, in greedy_until
    dataset = GenerativeTaskDataset(requests=requests, dataset_splits=self.DATASET_SPLITS)
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/data.py", line 44, in __init__
    sorted_enumerated_requests = sorted(enumerated_requests, key=lambda x: self._sorting_criteria(x[1]))
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/data.py", line 44, in <lambda>
    sorted_enumerated_requests = sorted(enumerated_requests, key=lambda x: self._sorting_criteria(x[1]))
  File "/fsx/lewis/git/hf/lighteval/src/lighteval/data.py", line 198, in _sorting_criteria
    return -(len(toks) + gen_length)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Large memory usage on MATH

Is the MATH benchmark expected to run for anything beyond batch_size=1?

Running the following command for a small model gives OOM on a single node of H100s which is a bit surprising to me:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B" \
    --override_batch_size 2

Strangely enough, bumping up the batch size for Mistral 7B is fine:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=mistralai/Mistral-7B-v0.1" \
    --override_batch_size 2

Perhaps there's some sort of unbounded generation occurring which is causing the memory to explode for certain models like Qwen?

Allow passing a `GenerationConfig` for generative evals

For some benchmarks, it can be preferable to generate responses via sampling with non-zero temperature. Although this affect reproducibility, it could be useful for power users who are constructing an internal benchmark with lighteval

One way to do this would be by allowing one to pass a generation_config field in the task metadata which is ultimately parsed by transformers.GenerationConfig (docs) before being passed to the generate() method

3 [short] comments (Batch size, launching, installation)

Hello! I've been experimenting with lighteval and wanted to share 3 things:

  1. I haven't found any information anywhere about the batch size. In all the examples, --override_batch_size 1 is used, I think it would be great to add a description of its function (In run_evals_accelerate.py for example) and mention that otherwise the batch size is calculated automatically.
  2. When trying to execute python -m accelerate launch --multi_gpu ... as indicated in the README, I get the following error: /usr/bin/python: No module named accelerate.__main__; 'accelerate' is a package and cannot be directly executed. I have no problem running accelerate launch --multi_gpu ....
  3. In my case using zsh, when executing pip install -e .[accelerate,quantization,adapters] I get the following error: no matches found: .[accelerate,quantization,adapters]. It is fixed by introducing ' ' in the package+dependencies (pip install -e '.[accelerate,quantization,adapters]').

Toni

[BUG]: lighteval.utils import is_autogptq_available not working

Hi,
It seems like lighteval.utils import is_autogptq_available differs the behavior from transformers.utils import is_auto_gptq_available
even tho when auto-gptq is installed.

from transformers.utils import is_auto_gptq_available
from lighteval.utils import is_autogptq_available

print(is_auto_gptq_available())
print(is_autogptq_available())

import importlib
print(importlib.util.find_spec("auto-gptq"))
! pip list | grep "auto-gptq"

# Output
True
False
None
auto-gptq                 0.4.2

Add EQ Bench

EQ Bench is a popular benchmark for chat models that aims to measure the emotional intelligence of LLMs. It is popular because it has high correlation with human preference evals like the LMSYS Chatbot Arena, but can be run at a fraction of the cost.

Leaderboard + code + paper: https://eqbench.com/about.html

Anomalously high scores on GPQA

Edit: after posting this, I realised that 25% accuracy is the same as random chance, so we should expect most small models to be around this range.

When running a small Qwen model through GPQA, I am getting anomalously large scores compared to much larger models like Llama-70b-chat in the paper. For reference, here's the values from the paper (see also this blog post for some other model comparisons):

Screenshot 2024-02-27 at 15 29 06

Now, when I run both 0-shot and 5-shot evals via:

# 0-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|0|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

# 5-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|5|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

I get:

Task Version Metric Value Stderr
lighteval:gpqa:0 0 acc 0.2567 Β± 0.0207
lighteval:gpqa:5 0 acc 0.2679 Β± 0.0209

These values are anomalously large for such a small model and I wonder if there's some issue in how we aggregate results?

A related question is whether we report the average accuracy across the extended / main / diamond sets or something else? I noticed in the Hub dataset that 4 configs are provided, but the task table just specifies the train split which I suspect just loads everything (including the expert annotations)
Screenshot 2024-02-27 at 15 31 56

I will also check Mixtral to see if the scores a much different

StarCoder2 3B SFT models give CUDA OOM on IFEval

For some peculiar reason, I am getting CUDA OOM when evaluating an SFT of bigcode/starcoder2-3b on ifeval. Note this doesn't happen with the 15b models which suggests either a bug on the transformers side, but I'm opening the issue first here in case you know if lighteval handles these models differently or whether it's a user error on my part :)

Here's the command to reproduce:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --model_args "pretrained=HuggingFaceH4/starcoder2-3b-sft-v00-deploy" \
    --use_chat_template \
    --tasks "extended|ifeval|0|0" \
    --custom_tasks "extended_tasks/ifeval/main.py" \
    --override_batch_size 1 \
    --output_dir "./scratch/evals"

Note you'll need transformers from main and the version I'm using is:

pip install 'transformers @ git+https://github.com/huggingface/transformers.git@0290ec19c901adc0f1230ebdccad11c40af026f5'

I'm also using lighteval commit df21407d9f714bde9ecfb4dd8283afdc2150eec3

I've inspected the inputs / outputs and everything looks good until I hit one sample that seems to blow up the memory.

Edit: the OOM issue is also present in gsm8k

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.