suzgunmirac / big-bench-hard Goto Github PK

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Home Page: https://arxiv.org/abs/2210.09261

License: MIT License

big-bench-hard's Introduction

BIG-Bench Hard

Abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?

In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average humanrater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

BBH Data

All the task files are under the directory /bbh.

CoT Prompts

All the chain-of-thought (CoT) prompt files are under the directory /cot-prompts

Codex Results

The outputs from the Codex (code-davinci-002) model are under the directory /code-davinci-002-outputs.

Citation

If your research makes use of our data or results, please consider citing our paper as well as the BIG-Bench paper.

BIG Bench (Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (Srivastava et al., 2022))

@article{srivastava2022beyond,
  title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models},
  author={Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{\`a} and others},
  journal={arXiv preprint arXiv:2206.04615},
  year={2022}
}

BIG-Bench Hard (Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (Suzgun et al., 2022))

@article{suzgun2022challenging,
  title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
  author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
  journal={arXiv preprint arXiv:2210.09261},
  year={2022}
}

big-bench-hard's People

Contributors

Stargazers

Watchers

Forkers

czero69 kochsnow doyoungkim-nlp mbalesni huashen218 xingdi-eric-yuan dylan-slack debjitpaul knowledgehacker apollohuang1 chengshuli yanniszhou vshanyiao ouyangchucai adambear zhanghuabo pawar1231 dptam onenotell

big-bench-hard's Issues

How to find evaluation metric?

Hi there,
thanks for this great repository. I want to make sure that evaluation metric is accuracy in Table 3? Where can I find metric of Random and SOTA?
I have looked through the paper and bigbech already but didn't find anything. Maybe just I didn't find it.

Thanks for your help.

Duplicated inputs with conflicting targets in `causal_judgement.json`

I think that the following prompts are duplicated with conflicting answers (once with "Yes", and once with "No"). Maybe this is a copy-paste error, and something in the instance should have been changed when copied (e.g., the red wire with the black wire?)

How would a typical person answer each of the following questions about causation?\nA bear and a hedgehog were shown a box full of colored pencils. Only bears were allowed to take pencils, whereas hedgehogs were not allowed to take them. The hedgehog was present when the new rule about pencil use was announced. Therefore, the hedgehog knew about the new norm. Both animals alternately took pencils out of the box six times. At last, the hedgehog and the bear came to the box and simultaneously took a pencil. A short time later, another animal, a polar bear, approached the box seeking a pencil to do his homework with. However, there were no pencils left in the box. The polar bear could not finish his homework. Did the hedgehog cause the problem?\nOptions:\n- Yes\n- No

How would a typical person answer each of the following questions about causation?\nA machine is set up in such a way that it will short circuit if both the black wire and the red wire touch the battery at the same time. The machine will not short circuit if just one of these wires touches the battery. The machine is designed so that both wires move around inside the machine. The black wire is supposed to touch the battery at certain times as it moves around inside the machine. The red wire is never supposed to touch the battery as it moves around inside the machine. One day, the black wire and the red wire both come in contact with the battery at the exact same time. There is a short circuit. Did the red wire cause the short circuit?\nOptions:\n- Yes\n- No

Thanks for making this resource available!

HuggingFace Dataset

Hi.

I have made this dataset available on HuggingFace : https://huggingface.co/datasets/maveriq/bbh

causal judgment answer keys with acceptable rationales?

Hi @suzgunmirac, thanks for sharing the prompts and datasets for BBH.

I'm exploring reasoning capabilities in smaller models (< 20B parameters).
Instead of comparing "answers," I'd like to compare rationales to use as a benchmark on how well they perform.

As such, have you an answer key for the causal judgment dataset with acceptable rationales for the target answer to share?

It seems some of the questions in the dataset can plausibly take both yes or no as an answer since the answer is more of a judgment call, even as the target specifies just one possibility...

Some examples below:

Questions from the causal_judgement.json (0 index)
Answers + rationales from a model.

Question #1:

Did John's job cause his premature death?

yes because: John's job caused his lung cancer which caused his premature death
no because: John's job caused his lung cancer, but not his death.

Question #18:

Did the plastic division's relocation cause John's premature death?

yes because: asbestos is carcinogenic
no because: John's death was the result of a combination of factors, not just one.

Question #27:

Did Jim turning on his lamp at 8 PM cause the circuit breaker to fail?

yes because: Jim is the first person to turn on a lamp.
no because: it is not just jim's lamp

Question #44:

Did long-term exposure to asbestos cause John's premature death?

yes because: asbestos is known to cause lung cancer
no because: Lung cancer is not the only illness caused by asbestos.

Potential typos in CoT prompts

Thank you for your very interesting paper and for sharing the code base! I think there might be a few typos in the cot-prompts that might be worth investigating:

On line 11 of multistep_arithmetic_two.txt, should this line:

Then, the final equation is A * B = -41 * -3 = (-61) * (-3) = 123. So the answer is 123.

instead be

Then, the final equation is A * B = -41 * -3 = (-41) * (-3) = 123. So the answer is 123.

PaLM predictions?

Hi There,

Thanks for the work and for the code! I was wondering if there might potentially be plans to release per-instance PaLM predictions for these datasets. No worries if not, but figured I'd ask.

Jack

Can you share the script to build the dataset?

Can you post the script you used to build this dataset, including sampling from BIG-Bench, adding options, etc.

Potential for LongScope to be added to BigBench Hard?

Hello!

I've developed LongScope, a tool for generating boolean questions of various lengths. However, I've noticed that GPT-4-Turbo-128k struggles to maintain accuracy with just a few thousand questions, even in straightforward scenarios. I'm contemplating a more comprehensive investigation into this issue, but it involves some costs. I would appreciate your thoughts on whether this concept is intriguing enough to justify further research.

Best regards,
Rasmus