tatsu-lab / alpaca_eval Goto Github PK

View Code? Open in Web Editor NEW

1.1K 10.0 163.0 220.13 MB

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Home Page: https://tatsu-lab.github.io/alpaca_eval/

License: Apache License 2.0

Python 8.63% Jupyter Notebook 91.37%

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

alpaca_eval's People

Contributors

Stargazers

Watchers

Forkers

kemolo sanderland openaccess-ai-collective postpcera victorsungo haniitani zhaopu7 hanningzhang yuimo xysnqdd ii0 techthiyanes 9randy ah74ba rl75po danxing0947 gavinchen1314 ramstorage fudp noahwoo ai-jie01 ailme-ai superbearzby apollohuang1 yulinchen99 brewswang imoneoi inayet pik-gane profliuhao jetrunner shuo-git scotthegedus bbsngg fjxmlzn pratik-behera dbz0825 wenhuiwu omnipotentai inferllm deeprnd shivchander haorenkk123 mz0in amshaker standardgalactic jinlmsft zsyjosh zekaigalaxy abhijitdalavi klonggan andreajparker 44670 openbuddy chenchenygu proalf ishaan-jaff jupyterjazz xianxl rolnand jfontestad hanseungwook areafather hywchina minglii1998 luolinrowling chauncygu yuchenlin rajkrishnamurthy daaabing maxidl zfang gridechelon yifengwang66 stjordanis lifan-yuan dorothee-sigg cassieesvelt sarahlmk renatz fabienroger huggingface billcai causallm josephrp mohnkhan sjehan tt6746690 pepelu0 genezc 5l1v3r1 hamishivi aoezis hyperdrivehustle muennighoff gohsyi xiechengmude lewtun c0de3 rishiraj

alpaca_eval's Issues

Consider using a fixed version of GPT-4

Since the OpenAI API is changing, please consider using a fixed GPT-4 version such as gpt-4-0314. Using gpt-4 can cause fluctuating results and make old results stale.

Related: imoneoi/openchat#42

[style] fix ill-formatted logging message

Some of the log messages are single multi-line strings (for example this). These multiline strings don't display nicely on console due to implicit tabs. Can reformat to use newlines.

add desired models to hugging face and precompute leaderboard

freewilly missing?

https://stability.ai/blog/freewilly-large-instruction-fine-tuned-models
shouldnt that be on here since they are winning on many benchmarks?

Post-analysis of the annotations

Since the two outputs are randomly ordered during evaluation, it is hard to conduct post-analysis of the evaluation results. Adding extra fields such as output_1_source, output_2_source, data_source in the output json file would be great.

[LOG] improve the logging for OpenAI maximum context length

I tried to run the chatgpt evaluator, but your OpenAPI requestor seems to go into infinite retry loops:

WARNING:root:OpenAIError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4452 tokens. Please reduce the length of the messages..
WARNING:root:Hit request rate limit; retrying...
(repeats forever)

And similar loops happen on trying gpt4 if you do not have access to the model or use an invalid api key. These seem to be caused by assuming any other error is a rate-limiting message.

I tried to fix the token limit loop by:

if "Please reduce your prompt" in str(e) or "This model's maximum context length" in str(e):

but it can still crash due to this line, probably since the sum of the generated samples is too long to fit.

if kwargs["max_tokens"] == 0:
   raise e

GPT4 rate limit

Hi, I am trying to evaluate our model output using alpaca_eval.

Here is the command:

export OPENAI_API_KEY="sk-xxxxxx"
alpaca_eval --model_outputs 'output/alpaca_eval/outputs.json'

Problem: While running, there are always "Rate limit reached" error.

NFO:openai:error_code=rate_limit_exceeded error_message='Rate limit reached for default-gpt-4 in organization org-0Ibh47ogWcbeM3DJsyr3EC29 on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
WARNING:root:OpenAIError: Rate limit reached for default-gpt-4 in organization org-0Ibh47ogWcbeM3DJsyr3EC29 on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues..
WARNING:root:Hit request rate limit; retrying...

Question:

How could I control the GPT4 access rate (e.g. add some delay in the code)?
After evaluation, I got n_total = 800, does this mean that there are 5 tests failed? Are these failures related to GPT4 rate limit error?

API call fails with long outputs

When the evaluated model outputs long responses, the evaluation API call will fail and keep retrying. Consider counting tokens with tiktoken and truncating the trailing X tokens of the evaluated model to reduce the total length to 8192 tokens.

WARNING:root:Unknown error This model's maximum context length is 8192 tokens. However, your messages resulted in 8344 tokens. Please reduce the length of the messages..

[TEST] add tests + CI

we need to add test to spend less time checking PRs and avoiding possible issues

code review

code review. Doesn't have to be extremely thorough, but let's make sure there are no big issues / things that could be greatly simplified
documentation review. You should test most important commands using the documentation, and update documentation if unclear.

Add support for a anthropic 0.3

TypeError when trying to run alpaca_eval

Using python 3.9. Getting the following error when trying to run alpaca_eval with any model

alpaca_eval evaluate_from_model --model_configs 'chatgpt' --annotators_config 'alpaca_eval_gpt4'

Traceback (most recent call last): File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/bin/alpaca_eval", line 8, in <module> sys.exit(main()) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/alpaca_eval/main.py", line 468, in main fire.Fire(ALL_FUNCTIONS) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/alpaca_eval/main.py", line 214, in evaluate_from_model evaluation_dataset = utils.load_or_convert_to_dataframe(evaluation_dataset) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/site-packages/alpaca_eval/utils.py", line 264, in load_or_convert_to_dataframe if isinstance(df, AnyPath): File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/typing.py", line 720, in __instancecheck__ return self.__subclasscheck__(type(obj)) File "/home/makeshn/ssd1/miniconda3/envs/alpaca_eval/lib/python3.9/typing.py", line 723, in __subclasscheck__ raise TypeError("Subscripted generics cannot be used with" TypeError: Subscripted generics cannot be used with class and instance checks

Why is the performance of Falcon-40b-instruct so poor?

[CI] Make automatic deployment on pypi

let's make everything automatic to spend less time on that

add all documentation

Dataset 'tatsu-lab/alpaca_eval' doesn't exist on the Hub

Hello,

I'm trying to evaluate my model using alpaca_eval, but I'm getting an error when loading the evaluation set. Dataset 'tatsu-lab/alpaca_eval' doesn't exist on the Hub. I checked the tatsu-lab huggingface account and it is indeed not available. Can you please advise?

Best regards,
Hani

Use chatGPT as baseline?

the number are getting close to 100% win rate we should consider recalibrating win rates by comparing to chatGPT

make full leaderboard for alpaca_eval

Update llama 2 chat results without system prompt

Seems like system prompt has officially been disabled: https://github.com/facebookresearch/llama/blob/main/UPDATES.md

add leaderboard of base models

Not needed for monday.

simple script to perform SFT
create leaderboard of models

benefit is that new base models will directly be evaluated on our leaderboard => no work for us to do

[ENH] use async instead of multi processing

add evaluation of any hugging face model

Why evaluate_from_model run so slow on my side

I am running with 8 A40 GPU and I think it should be fast. I set up the environment and run alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'claude' and alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'alpaca_eval_gpt4' but it takes a few days.
Also it is surprising that I didn't provide any API key but it still runs. Why is it? Thank you so much for you help!

add configs for all models we tested

add configs of models we tested
add prompts of models we tested
add comments or mini readme in the config

this will be all the verified models in the leaderboard.

Check matching `input_ids` for llama2

Question about 805 eval examples

Hi, thanks for your excellent work, especially the great efforts on designing and evaluating the evaluators.

Compared to these well-designed details, the 805 eval instructions does not seem to have much explanation. The alpacafarm paper only provides the root verb distribution and the source of the instructions. I would like to ask if the topics of these instructions are carefully selected, such as whether they cover mathematics, coding, reasoning, etc., and could you provide some principles on building the eval instructions set?

Thanks a lot!

Make pypy package and test it

would be good to test a couple of commands from the documentation, like this you can update the documentation if something is unclear

chatgpt_fn returned json parsing error

It seems that the Open AI updated their returned results for ChatGPT queries, when server overloaded (or exceeding query limit)? I am now getting a json parsing error after completing partial annotations with chatgpt_fn.

Error traceback attached below.

INFO:root:Creating the annotator from chatgpt_fn.
INFO:root:Saving annotations to /home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/evaluators_configs/chatgpt_fn/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/evaluators_configs/chatgpt_fn/annotations_seed0_configs.json.
WARNING:root:The length of outputs before and after merge are not the same. We have len(outputs_1)==
805, len(outputs_2)==657, and len(df_annotated)==657.
This means that there are missing examples or duplicates. We are taking a SQL inner join.

INFO:root:Annotating 640 examples with chatgpt_fn
INFO:root:Using openai_completions on 640 prompts using gpt-3.5-turbo-16k-0613.
INFO:root:Kwargs to completion: {'max_tokens': 50, 'temperature': 0, 'function_call': {'name': 'print_best_model'}, 'functions': [{'name': 'print_best_model', 'description': 'Print the best model given the preferred output.', 'parameters': {'type': 'object', 'properties': {'best_output': {'type': 'string', 'description': "Name of the best output, should be 'Output (a)' or 'Output (b)'"}}}, 'required': ['best_output']}]}
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-3.5-turbo-16k-0613', 'is_chat': True, 'max_tokens': 50, 'temperature': 0, 'function_call': {'name': 'print_best_model'}, 'functions': [{'name': 'print_best_model', 'description': 'Print the best model given the preferred output.', 'parameters': {'type': 'object', 'properties': {'best_output': {'type': 'string', 'description': "Name of the best output, should be 'Output (a)' or 'Output (b)'"}}}, 'required': ['best_output']}]}
prompt_batches: 15%|████████████████████▌ | 99/640 [00:12<01:08, 7.90it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 205, in _openai_completion_helper
all_args = json.loads(choice.message.function_call.arguments)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/bin/alpaca_eval", line 8, in
sys.exit(main())
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/main.py", line 483, in main
fire.Fire(evaluate)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/main.py", line 126, in evaluate
annotations = annotator.annotate_head2head(
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 316, in annotate_head2head
out = self.annotate_pairs(df_to_annotate, **decoding_kwargs)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 346, in annotate_pairs
df_annotated = self._annotate(df_to_annotate, **decoding_kwargs)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 437, in _annotate
curr_annotated = self.annotators[annotator](df_annotated.loc[curr_idcs, self.all_keys], **decoding_kwargs)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 676, in call
completions = self.fn_completions(prompts=prompts, **self.completions_kwargs, **decoding_kwargs)
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 140, in openai_completions
completions = list(
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/liu/.conda/envs/hao_alpaca_eval_py310/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

add community / verified / minimal

adding helper

evaluate from model doesn't load leaderboard correctly

[CI] add leaderboard formatting to CI

Add a fixed version GPT4 evaluator for the leaderboard

See #126

Pandas merge error

Trying to test the eval script:

alpaca_eval --model_outputs 'example/outputs.json'

I tried with python3.10 and python3.11, and got the same error like this:

  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/bin/alpaca_eval", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/main.py", line 483
, in main
    fire.Fire(evaluate)
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/fire/core.py", line 141, in Fi
re
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/fire/core.py", line 475, in _F
ire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/fire/core.py", line 691, in _C
allAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/main.py", line 126
, in evaluate
    annotations = annotator.annotate_head2head(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/annotators/pairwis
e_evaluator.py", line 316, in annotate_head2head
    out = self.annotate_pairs(df_to_annotate, **decoding_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 344, in annotate_pairs
    df_to_annotate = self._preprocess(to_annotate)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 387, in _preprocess
    df_to_annotate = self._merge_annotations(df_to_annotate, self.df_annotations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 533, in _merge_annotations
    df_to_annotate = df_to_annotate.merge(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/pandas/core/frame.py", line 9843, in merge
    return merge(
           ^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/pandas/core/reshape/merge.py", line 148, in merge
    op = _MergeOperation(
         ^^^^^^^^^^^^^^^^
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/pandas/core/reshape/merge.py", line 741, in __init__
    self._maybe_coerce_merge_keys()
  File "/export/home/cxia/congyingxia-scratchpad/alpaca_eval/envs/lib/python3.11/site-packages/pandas/core/reshape/merge.py", line 1401, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

[CI] CI for integration tests

need to deal with private keys

currently has to be done manually pytest --slow tests

add data to setup.py

Add support for python 3.9

Add documentation / process for contributing

discord
README (outputs / number)

Ideally, we would have the outputs but I don't want the package to be too large so I don't want the package to have the outputs. One possibility is to make it optional but say that outputs need to be on hugging face hub. Then in the leaderboard we say whether the data is there, which would make the results more believable

What an awesome job， can you share the all_annotations_claude.json files?

add analysis of eval set

use pairwise ttest on ranking on all leaderboards from the dataset

heatmap plotting
compute the number of samples to get statistical significance at a given rate
return mean and max p value

Falcon support for generation

Running

alpaca_eval evaluate_from_model --model_configs 'falcon-7b-instruct'

Gives the following warning

The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

It then proceeds with generation. We need to investigate if this is a bug or not. We should probably just rewrite the inference code to not use the HF generation pipeline and roll our own loop.

llama 13b as evaluate_from_model does not use GPU

The code is running on 1 * A100, I am using nvidia-smi to check and find that the GPU is not used. After running alpaca_eval evaluate_from_model --model_outputs $PWD/qa.json --annotators_config 'alpaca_eval_gpt4' --model_configs $PWD/llama for a few minutes reports an error. May I ask what the cause is? Where am I not set up properly ?

ded_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/root/miniconda3/envs/eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1035, in unk_token
    return str(self._unk_token)
RecursionError: maximum recursion depth exceeded while calling a Python object

Run command

alpaca_eval evaluate_from_model --model_outputs $PWD/qa.json --annotators_config 'alpaca_eval_gpt4' --model_configs $PWD/llama

model config

# configs.yaml
knowlm-13b:
  prompt_template: "/root/model-eval/llama/prompt.txt"
  fn_completions: "huggingface_local_completions"
  completions_kwargs:
    model_name: "/root/.cache/LLAMA/" # LLAMA 13B
    model_kwargs:
      torch_dtype: 'float32'
    max_new_tokens: 2000
    temperature: 0.7
    top_p: 1.0
    do_sample: True
  pretty_name: "LLAMA 13B"
  link: "https://example.com/"

env

> python --version
Python 3.10.11

>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.__version__)
2.0.1+cu117

> tree /root/.cache/LLAMA/
|-- config.json
|-- generation_config.json
|-- model-00002-of-00006.safetensors
|-- pytorch_model-00001-of-00006.bin
|-- pytorch_model-00002-of-00006.bin
|-- pytorch_model-00003-of-00006.bin
|-- pytorch_model-00004-of-00006.bin
|-- pytorch_model-00005-of-00006.bin
|-- pytorch_model-00006-of-00006.bin
|-- pytorch_model.bin.index.json
|-- special_tokens_map.json
|-- tokenizer.model
|-- tokenizer_config.json

Make leaderboard page

toggle: community / verified / minimal
toggle: gpt4 / claude

Update `example/outputs.json` to use merged instruction

Separate files per provider?

hey @YannDubs - we should have palm-2-chat-bison in by the end of this week, will add it then.

Y'all have an interesting approach of an individual file per provider

Why do it this way - vs. making the completion call inside the init.py?

I also noticed you're calculating cost per provider <- why is that?

alpaca farm migration

Strange prompt(s)

Investigating some results, I came across this prompt, which is very strange.
It is formatted as a conversation, with no particular instruction about it.

  {
    "instruction":"User : Hi dear \nAgent : helo , can ai help you \nUser : plaes tell me about what frequency conscious of ai\nAgent : the conscious ai is a complex actually need specific information to provide about conscious.\nUser : pleas tell me more about conscious if you know please let me know\nAgent : conscious is about made some result or decision is to hard if not have knowledge or prove with science.\nUser : Please tell more about that.\nAgent : need a data scientist to provide decision because a different conscious human and artificial intelligence.",
    "output":"The conscious AI requires data scientists to make decisions because the conscious of humans and artificial intelligence are different. This requires extensive knowledge and proof from science in order to make the correct decisions or achieve the desired results.",
    "generator":"text_davinci_003",
    "dataset":"oasst"
  },

Can you elaborate on the prompt "templates" which use "Instruction: ... Output: ..." on models that are already instruction-finetuned? What is the best way to just have them use instructions.

[DOC] Update Anthropic docstring

Anthropic changed their python sdk - making this code line outdated.

alpaca_eval/src/alpaca_eval/decoders/anthropic.py

Line 38 in 7e85b5e

Additional kwargs to pass to `anthropic.Client.completion`.

Would love to know if this might help - https://github.com/BerriAI/litellm

~Simple I/O library, that standardizes all the llm api calls to the OpenAI call

from litellm import completion

## set ENV variables
# ENV variables can be set in .env file, too. Example in .env.example
os.environ["OPENAI_API_KEY"] = "openai key"
os.environ["ANTHROPIC_API_KEY"] = "anthropic key"

messages = [{ "content": "Hello, how are you?","role": "user"}]

# openai call
response = completion(model="gpt-3.5-turbo", messages=messages)

# anthropic call
response = completion("claude-v-2", messages)

[Question] Alternative for GPT4 eval?

I am still forbidden from approaching GPT4 api, so is there an alternative for evaluation?