thudm / longbench Goto Github PK

View Code? Open in Web Editor NEW

555.0 6.0 36.0 3.18 MB

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

License: MIT License

Python 88.70% TeX 5.66% Shell 5.64%

benchmark llm longtext long-context

longbench's Introduction

🤗 HF Repo • 📃 Paper

阅读中文版本.

📖 LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion.

We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long context scenarios (such as manual annotation costs or API call costs). Therefore, we adopt a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long contexts at the lowest cost.

LongBench includes 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k, and a total of 4,750 test data. For detailed statistics and construction methods of LongBench tasks, please refer here. In addition, we provide LongBench-E, a test set with a more uniform length distribution constructed by uniform sampling, with comparable amounts of data in the 0-4k, 4k-8k, and 8k+ length intervals to provide an analysis of the model's performance variations at different input lengths.

Task Type	#English Task	#Chinese Task	#Code Task
Multi-document QA	3	1	-
Single-document QA	3	1	-
Summarization	3	1	-
Few-shot learning	3	1	-
Synthetic Tasks	2	1	-
Code Completion	-	-	2

🔥 Updates

[2024/02/01] Check out our new effort in Long context LLMs: LongAlign. We explore the best recipe for long context alignment. We also propose LongBench-Chat, the first real-world long context evaluation benchmark (10k-100k input length). We also release an instruction-following dataset at HF Dataset, along with a suite of competitive long context LLMs trained with LongAlign!

[2023/10/30] The new ChatGLM3-6B-32k chat model is out, with better proficiency at long context modeling and is especially good at long document based question answering, reasoning and summarization. Check out its performance on LongBench.

[2023/08/29] The LongBench paper is released, along with several important updates to LongBench:

More comprehensive datasets: The MultiNews dataset for multi-document summarization is added to the summarization tasks, and the summarization task SAMSum is added to the Few-shot learning tasks, replacing the previous QA task NQ. TriviaQA and RepoBench-P are resampled to ensure a more appropriate data length;
More uniformed length distribution: LongBench-E is obtained by uniform sampling according to length, featuring a comparable amount of test data in the length intervals of 0-4k, 4-8k, and 8k+, which is more suitable for evaluating the model's ability in different input lengths variation;
All evaluation codes made public: The code for evaluating all models has been made public, and the code for retrieval-based and summarization-based long context compression strategies are also provided.

🖥️ Leaderboard

Here is the average scores (%) on the main task categories in both Chinese and English languages under the Zero-shot scenario. Please refer to this link for the evaluation metrics used for each task.

Note: For text exceeding the processing length capability of the model, we truncate from the middle of the text, preserving information from the beginning and end, in accordance with the observations from Lost in the Middle. Experiments show that this truncation method has the least impact on model performance.

English

	Avg	Single-Doc QA	Multi-Doc QA	Summarization	Few-shot Learning	Code Completion	Synthetic Tasks
GPT-3.5-Turbo-16k	44.0	39.8	38.7	26.5	67.1	54.1	37.8
Llama2-7B-chat-4k	31.0	24.9	22.6	24.7	60.0	48.1	5.9
LongChat-v1.5-7B-32k	34.3	28.7	20.6	26.7	60.0	54.1	15.8
XGen-7B-8k	28.3	24.6	20.4	24.7	56.2	38.6	5.3
InternLM-7B-8k	24.2	17.4	20.2	16.1	50.3	36.4	4.5
ChatGLM2-6B-32k	40.9	32.9	33.7	27.6	59.1	52.7	39.2
Vicuna-v1.5-7B-16k	31.9	28.0	18.6	26.0	66.2	47.3	5.5
ChatGLM3-6B-32k	48.5	40.3	46.6	29.5	68.1	56.2	50.5

Chinese

	Avg	Single-Doc QA	Multi-Doc QA	Summarization	Few-shot Learning	Code Completion	Synthetic Tasks
GPT-3.5-Turbo-16k	44.5	61.2	28.7	16.0	29.2	54.1	77.5
Llama2-7B-chat-4k	14.3	11.9	5.2	0.2	19.8	48.1	0.5
LongChat-v1.5-7B-32k	23.9	29.1	19.5	9.9	23.2	54.1	7.6
XGen-7B-8k	15.1	14.8	11.0	2.2	20.5	38.6	3.5
InternLM-7B-8k	18.3	33.6	11.1	12.4	15.2	36.4	0.9
ChatGLM2-6B-32k	41.7	51.6	37.6	16.2	27.7	52.7	64.5
Vicuna-v1.5-7B-16k	26.4	43.0	19.3	15.1	28.8	47.3	5.0
ChatGLM3-6B-32k	52.8	62.3	44.8	17.8	42.0	56.2	94.0

Radar Chart on Long Context Capability

Variation of Abilities under Different Context Lengths

To specifically analyze the model's performance under different context lengths, the following chart shows the models' total scores averaged across all tasks by task category over different context length intervals in LongBench-E.

⚙️ How to evaluate on LongBench

Load Data

You can download and load the LongBench data through the Hugging Face datasets (🤗 HF Repo):

from datasets import load_dataset

datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')

Similarly, you can load the LongBench-E data

from datasets import load_dataset

datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa", "gov_report", "multi_news", "trec", \
            "triviaqa", "samsum", "passage_count", "passage_retrieval_en", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test')

Alternatively, you can download the folder from this link to load the data.

Data Format

All data in LongBench (LongBench-E) are standardized to the following format:

{
    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc",
    "context": "The long context required for the task, such as documents, cross-file code, few-shot examples in Few-shot tasks",
    "answers": "A List of all true answers",
    "length": "Total length of the first three items (counted in characters for Chinese and words for English)",
    "dataset": "The name of the dataset to which this piece of data belongs",
    "language": "The language of this piece of data",
    "all_classes": "All categories in classification tasks, null for non-classification tasks",
    "_id": "Random id for each piece of data"
}

Evaluation

Install the requirements with pip: pip install -r requirements.txt. For Llama-2 based models, we recommend using Flash Attention for optimization and saving GPU memory The relevant dependencies can be installed according to the code base of Flash Attention.

First, run pred.py and select the model you want to evaluate via --model. Let's take ChatGLM3-6B-32k as an example (HuggingFace model weight will be downloaded automatically according to the path in model2path.json, you can change the path in this file to load the model weight from local):

CUDA_VISIBLE_DEVICES=0 python pred.py --model chatglm3-6b-32k

You can also run inference on multi-gpus in parallel (one model per gpu):

CUDA_VISIBLE_DEVICES=0,1,2,3 python pred.py --model chatglm3-6b-32k

You can obtain the output of the model under all LongBench datasets under the pred/ folder corresponding to the model name. Similarly, with the --e command:

CUDA_VISIBLE_DEVICES=0 python pred.py --model chatglm3-6b-32k --e

You can obtain the output on LongBench-E under the pred_e/ folder. After that, run the evaluation code in eval.py:

python eval.py --model chatglm3-6b-32k

You can get the evaluation results on all datasets in result.json. The average score of the model over different length intervals in all LongBench-E datasets can be obtained with the --e command.

Please note that in config/, we provide the input format suitable for each dataset and the maximum output length. Feel free to modify them to better suit the model you want to evaluate. After modification, when evaluating with pred.py, the data will be automatically organized according to the new format to get the corresponding model output.

In addition we provide the code for the long context compression evaluation based on retrieval and summarization (see Section 4.2 in the LongBench paper for the implementation details) in the folders retrieval/ and summ/, respectively.

📊 Evaluation Result on Each Dataset

The following tables show the Zero-shot evaluation results (%) on all datasets, where Chinese datasets are denoted by "zh" (please refer to this link for the evaluation metrics used for each task).

Single-Document QA

	NarrativeQA	Qasper	MultiFieldQA-en	MultiFieldQA-zh
GPT-3.5-Turbo-16k	23.6	43.3	52.3	61.2
Llama2-7B-chat-4k	18.7	19.2	36.8	11.9
LongChat-v1.5-7B-32k	16.9	27.7	41.4	29.1
XGen-7B-8k	18.0	18.1	37.7	14.8
InternLM-7B-8k	12.1	16.7	23.4	33.6
ChatGLM2-6B-32k	21.1	31.5	46.2	51.6
Vicuna-v1.5-7B-16k	19.4	26.1	38.5	43.0
ChatGLM3-6B-32k	26.0	43.3	51.7	62.3

Multi-Document QA

	HotpotQA	2WikiMQA	Musique	DuReader (zh)
GPT-3.5-Turbo-16k	51.6	37.7	26.9	28.7
Llama2-7B-chat-4k	25.4	32.8	9.4	5.2
LongChat-v1.5-7B-32k	31.5	20.6	9.7	19.5
XGen-7B-8k	29.7	21.1	10.3	11.0
InternLM-7B-8k	28.7	22.8	9.0	11.1
ChatGLM2-6B-32k	45.1	34.0	21.9	37.6
Vicuna-v1.5-7B-16k	25.3	20.8	9.8	19.3
ChatGLM3-6B-32k	54.4	44.9	40.4	44.78

Summarization

	GovReport	QMSum	MultiNews	VCSUM (zh)
GPT-3.5-Turbo-16k	29.5	23.4	26.7	16.0
Llama2-7B-chat-4k	27.3	20.8	25.8	0.2
LongChat-v1.5-7B-32k	30.8	22.7	26.4	9.9
XGen-7B-8k	27.3	20.5	26.2	2.2
InternLM-7B-8k	9.7	15.9	22.8	12.4
ChatGLM2-6B-32k	32.4	24.0	26.5	16.2
Vicuna-v1.5-7B-16k	27.9	22.8	27.2	15.1
ChatGLM3-6B-32k	36.8	23.9	27.9	17.8

Few-shot Learning

	TREC	TriviaQA	SAMSum	LSHT (zh)
GPT-3.5-Turbo-16k	68.0	91.4	41.7	29.2
Llama2-7B-chat-4k	61.5	77.8	40.7	19.8
LongChat-v1.5-7B-32k	63.5	82.3	34.2	23.2
XGen-7B-8k	65.5	77.8	25.3	20.5
InternLM-7B-8k	52.0	77.8	21.2	15.2
ChatGLM2-6B-32k	62.5	78.7	36.3	27.7
Vicuna-v1.5-7B-16k	71.5	86.2	40.8	28.8
ChatGLM3-6B-32k	79.0	87.1	38.2	42.0

Synthetic Tasks

	Passage Count	PassageRetrieval-en	PassageRetrieval-zh
GPT-3.5-Turbo-16k	4.5	71.0	77.5
Llama2-7B-chat-4k	2.1	9.8	0.5
LongChat-v1.5-7B-32k	1.0	30.5	7.6
XGen-7B-8k	2.1	8.5	3.5
InternLM-7B-8k	3.0	6.0	0.9
ChatGLM2-6B-32k	1.5	77.0	64.5
Vicuna-v1.5-7B-16k	6.5	4.5	5.0
ChatGLM3-6B-32k	2.0	99.0	94.0

Code Completion

	LCC	RepoBench-P
GPT-3.5-Turbo-16k	54.7	53.6
Llama2-7B-chat-4k	52.4	43.8
LongChat-v1.5-7B-32k	53.0	55.3
XGen-7B-8k	38.6	38.6
InternLM-7B-8k	44.1	28.8
ChatGLM2-6B-32k	55.6	49.9
Vicuna-v1.5-7B-16k	51.0	43.5
ChatGLM3-6B-32k	57.66	54.76

📄 Acknowledgement

Some of the tasks of LongBench are based on the datasets proposed by previous researchers, including HotpotQA, 2WikiMultihopQA, MuSiQue, DuReader, NarrativeQA, Qasper, GovReport, QMSum, MultiNews，VCSUM, TriviaQA, TREC, SAMSum，LSHT, LCC and RepoBench-P.

📝 Citation

@article{bai2023longbench,
  title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
  author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2308.14508},
  year={2023}
}

When citing our work, please kindly consider citing the original dataset papers. The relevant citation information is listed here.

longbench's People

Contributors

Stargazers

Watchers

longbench's Issues

测试13b，比如百川，1*A100（80G）会OOM

多进程的原因吗？减少进程数可解决吗

Is it necessary to add build_prompt to the tokenizer of chatglm3-6b-32k in pred.py?

Congratulations, this data set did a great job. I ran into no build_prompt in tokenizer when using pred.py to predict chatglm3-6b-32k. At the same time, I noticed that tokenization_chatglm.py in chatglm2-6b-32k does have a build_prompt() method, but not in chatglm3-6b-32k. Does chatglm3-6b-32k not need this prompt?

Kimi-Chat 测试

Kimi-Chat 据说长文本能力非常强大，期待仓库 owner 可以补全这块评测

能否发布画图代码

1、很棒的工作，能否发布readme的图片绘制代码？
2、评测结果是取所有数据评测还是部分？（pred.py里面每个任务取10条数据）

CUDA out of memory error happened. device: nvidia - A100

Error log:
Traceback (most recent call last):
File "pred.py", line 68, in
preds = get_pred(model, tokenizer, data, max_length, max_gen, prompt_format, dataset, device)
File "pred.py", line 27, in get_pred
output = model.generate(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 932, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 828, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 638, in forward
layer_ret = layer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 542, in forward
attention_output, kv_cache = self.self_attention(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 439, in forward
context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 278, in forward
attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.79 GiB (GPU 0; 79.20 GiB total capacity; 49.97 GiB already allocated; 9.89 GiB free; 68.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The only thing I change is the part of load data in pred.py, I download the data from Huggingface and unzip it.
device :nvidia A100 menmory:80G
Is that the device`s problem?
The read data code and the data path show In below:

LongbenchE & Longbench

What is the difference between LongbenchE and Longchne, and is there any overlapping data between them?

评测是否能用于Base模型?

观察给出的评测结果, 都是Chat类的模型, 而且prompt也是zero-shot进行组织, 是否意味着现在的设置是针对SFT模型, 并不适合于Base模型?

Evaluate on long context (32k,64k etc..) on 30B/70B large models

Hi,

I found that the original script cannot handle large models on long context effectively, since it use multiprocess to load an entire model on a single gpu.

I also tried different methods to add support 30B/70B models such as deepspeed-inference ,accelerate, vllm. Finally vllm can support benchmark on large models with long context (34B with 32k context with a 8*a800 node in my case) and it requires minimum changes to the original code.

I hope this information can help people who also want to evaluate on large models

单卡A100无法推理

您好，我运行

CUDA_VISIBLE_DEVICES=0 python pred.py --model chatglm3-6b-32k

后，发现显存炸了，请问对于chatglm3-6b-32k模型的显存要求最少是多少？单卡A100推理不了。
报错信息如下
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 17.95 GiB (GPU 0; 79.35 GiB total capacity; 48.28 GiB already allocated; 11.78 GiB free; 66.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

CUDA error??????

When I tried to use chatglm3-6b to test on LongBench, I got the following error after loading the model:

"variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

Could someone help me?

AttributeError: 'str' object has no attribute 'to'

Traceback (most recent call last):
  File "/home/./miniconda3/envs/longbench/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/./miniconda3/envs/longbench/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/./GitHub/LongBench-main/pred.py", line 66, in get_pred
    input = prompt.to(device)
            ^^^^^^^^^
AttributeError: 'str' object has no attribute 'to

Correct prompt for Llama2-chat

Hi! Firstly, I am very glad for the great work on the bilingual dataset! However, based on my experiments on LEval, LLama2-chat has a very strong ability on long document tasks. But the results seem divergent on LongBench. And I think a potential reason is that we use different prompt formats.
Here is my prompt: B_INST + B_SYS + system_prompt + E_SYS + context + instruction + E_INST.
The context is truncated from the right~

[Enhancement] Add LongBench into OpenCompass

Hi, thanks for the great works.

Welcome to update the LongBench into OpenCompass. We can update the leaderboard with LongBench in the future once the the LongBench is supported by the OpenCompass

Docs: https://opencompass.readthedocs.io/en/latest/advanced_guides/new_dataset.html
Examples:

If you need more help, feel free to raise an issue in OpenCompass

chatglm3这个效果说没有在微调的时候灌数据我是不信的→_→

why use greedy decoding ?

hi, pred.py shows that greedy decoding is used.
is there any particular reason for that?

OOM

我想比较直接外推、插值、NTK的效果，于是我在pred.py中将截断部分的逻辑删除，于是在seq_len=21000的长度时，我的A800报出了OOM的错误，这是正常的吗，请问应该怎么解决，我没用xformer

No repeat_kv in llama_flash_attn_monkey_patch.py ?

In llama 2's code, the code use repeat_kv before attention to repeat k/v heads(GQA).

But I didn't find any repeat_kv in the llama_flash_attn_monkey_patch.py.
I think this should be a mistake🤔If llama 2 is evaluated without repeat_kv, I think the result would also be incorrect.
Is my understanding correct?

单张A100 40G 无法运行（OOM） llama2-7b-chat-4k，但是可以运行 chatglm2-6b-32k

以下为使用四卡A100 40G（CUDA_VISIBLE_DEVICES=0 也是一样的情况）运行 python pred.py --model llama2-7b-chat-4k --e 的输出，确认max length是3500，居然需要申请140G显存？根据输出显示llama也开启了flash attention。同样的环境chatglm2-6b-32k完全没有显存问题，是因为chatglm用了特殊的技术吗？

另外我尝试把max length改成1500会报错RuntimeError: cu_seqlens_q must have shape (batch_size + 1)这是预期内的吗，我不理解这跟batch size有什么关系呀

+ python pred.py --model llama2-7b-chat-4k --e
use FlashAttention
Loading checkpoint shards: 100%|██████████| 2/2 [00:48<00:00, 24.45s/it]
Model: llama2-7b-chat-4k, Max Length: 3500

Traceback (most recent call last):
  File "/anonymous/LongBench/pred.py", line 165, in <module>
    preds = get_pred(model, tokenizer, data, max_length, max_gen, prompt_format, dataset, device, model_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/LongBench/pred.py", line 78, in get_pred
    output = model.generate(
             ^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1673, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2521, in greedy_search
    outputs = self(
              ^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/LongBench/llama_flash_attn_monkey_patch.py", line 119, in forward
    x_unpad, indices, cu_q_lens, max_s = unpad_input(x, key_padding_mask)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/flash_attn/bert_padding.py", line 118, in unpad_input
    index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anonymous/.env/lib/python3.11/site-packages/flash_attn/bert_padding.py", line 17, in forward
    return torch.gather(
           ^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.63 GiB. GPU 0 has a total capacty of 39.56 GiB of which 25.98 GiB is free. Including non-PyTorch memory, this process has 13.57 GiB memory in use. Of the allocated memory 12.99 GiB is allocated by PyTorch, and 91.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

关于被测试的模型

如果要测试自己已经训练好的模型的话，是不是路径里面存放的是对应的那个checkpoint的地址

Llama2-7B-chat-4k测试出来结果不一样

拉了代码下来测试Llama2-7B-chat-4k结果和论文不一样呢，对于代码只取消了flash attention的使用，机器使用V100，测试结果出来要比论文好一些，请问不使用flash attention会导致这种差异吗
{ "passage_count": 2.92, "lsht": 18.25, "samsum": 41.25, "lcc": 58.23, "musique": 8.02, "qmsum": 20.84, "narrativeqa": 18.61, "passage_retrieval_zh": 9.12, "trec": 64.0, "2wikimqa": 31.32, "multi_news": 26.34, "triviaqa": 83.51, "multifieldqa_en": 36.91, "dureader": 6.64, "hotpotqa": 27.77, "gov_report": 26.82, "repobench-p": 52.12, "vcsum": 0.17, "multifieldqa_zh": 11.82, "passage_retrieval_en": 7.0, "qasper": 21.69 }

Cannot load LongBench-E

When I run

from datasets import load_dataset

datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa", "gov_report", "multi_news", "trec", \
            "triviaqa", "samsum", "passage_count", "passage_retrieval_en", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test')

the error occurs:

ValueError: BuilderConfig qasper_e not found. Available: ['multifieldqa_en', 'lcc', 'passage_retrieval_zh', 'qasper', 'nq', 'passage_retrieval_en', 'gov_report', 'triviaqa', 'qmsum', 'trec', '2wikimqa', 'dureader', 'lsht', 'passage_count', 'repobench-p', 'hotpotqa', 'narrativeqa', 'vcsum', 'musique', 'multifieldqa_zh']

Is LongBench-E available now ?

好奇为什么多文档QA（中文）的部分又使用的rouge-L，而不是保持跟英文一致的F1？

这真的让人很好奇。。。如果是因为测评后F1较低而rouge-L较高，所以采用后者？（无质疑和贬义，只是探讨）如果是的话，对这个benchmark第一眼看起来会觉得很奇怪

how to apply to baichuan?

I want to use the model for baichuan, how can I do it?

Max length not guarenteed

In pred.py, max length is applied before build_chat. Yet after build_chat, some sentences may exceed max_length, which may cause performance breakdown if the model's max token length is equal to max_length.

How is the data length distribution computed for LongBench-E?

Thanks for sharing your code and paper!

I'm trying to understand the distribution of several datasets in LongBench-E, as the numbers I get are not aligned with the description in task.md and Table 7 in the paper.

from datasets import load_dataset

subsets = ['qasper', 'hotpotqa', '2wikimqa', 'gov_report']

for subset in subsets:
    data = load_dataset('THUDM/LongBench', f"{subset}_e", split='test')    
    context_lengths = [len(row['context']) for row in data]

    print(f'Dataset: {subset}_e')
    print(f'Number of Samples: {len(context_lengths)}')
    print(f'Number of Samples with Context Length < 4000: {len([x for x in context_lengths if x < 4000])}')
    print(f'Number of Samples with Context Length 4000-8000: {len([x for x in context_lengths if x >= 4000 and x < 8000])}')
    print(f'Number of Samples with Context Length 8000+: {len([x for x in context_lengths if x >= 8000])}')

Expected output:
qasper_e should have about 100, 100, 24 etc according to the table.

Output:

Dataset: qasper_e
Number of Samples: 224
Number of Samples with Context Length < 4000: 0
Number of Samples with Context Length 4000-8000: 0
Number of Samples with Context Length 8000+: 224
Dataset: hotpotqa_e
Number of Samples: 300
Number of Samples with Context Length < 4000: 2
Number of Samples with Context Length 4000-8000: 2
Number of Samples with Context Length 8000+: 296
Dataset: 2wikimqa_e
Number of Samples: 300
Number of Samples with Context Length < 4000: 0
Number of Samples with Context Length 4000-8000: 2
Number of Samples with Context Length 8000+: 298
Dataset: gov_report_e
Number of Samples: 300
Number of Samples with Context Length < 4000: 0
Number of Samples with Context Length 4000-8000: 1
Number of Samples with Context Length 8000+: 299

After adding prompt and tokenizing, I would imagine the context lengths be even longer, which is not aligned with having 100 samples with context length < 4000 as said in the paper.

KeyError: 'retrieved'

Thanks for great work! I've encountered an issue while running the pred.py script for long context compression evaluation based on retrieval. The script is throwing a KeyError: 'retrieved,' which is preventing me from making progress. Upon investigation, it appears that the 'LongBench.py' script may not have the necessary code to handle data with the 'retrieved' key.

使用chatglm3-6b-32k 无法复现repo dureader的结果

拉源码和chatglm3-6b-32k，结果为41.32，与repo有差距

评测模型有在对应数据集上微调过吗？

如题

LongBench-E和LongBench有什么区别？

请问LongBench-E和LongBench有什么区别？
在文档中没有看到具体的对比介绍。

About prompt

As https://github.com/THUDM/LongBench/blob/main/pred.py#L21 says, chat models are better off without build prompt on these tasks.

Could you please explain some details or references about it? thanks.

Missing passage_retrieval_en and and passage_retrieval_zh in dataset2prompt.json and dataset2maxlen.json

see title.

Thanks!

classification_score计算得分代码有误

https://github.com/THUDM/LongBench/blob/main/metrics.py#L98
代码中，em_match_list != 0 会永远为True，此处是否修改为 len(em_match_list )!=0

About the calculation of the Avg score?

it appears that the average is calculated as the direct mean of each category (https://github.com/THUDM/LongBench#english). A more fair approach may be to calculate a weighted mean using the number of each category?

超长文本的推理OOM

您好，我目前在做一些模型的超长文本评测，在使用LongBench，当length设置为20k的时候，单张80G的A100做7B模型的推理就会OOM，想知道您这边之前做评测的时候是怎么解决这个问题的

add support for other models

现有的代码是为glm tokenizer的，其他decoder-only model的tokenizer没有build_prompt()这个方法，需要自己在前面加上：

def build_prompt(query, history=None):
    if history is None:
        history = []
    prompt = ""
    for i, (old_query, response) in enumerate(history):
        prompt += "[Round {}]\n\n问：{}\n\n答：{}\n\n".format(i + 1, old_query, response)
    prompt += "[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
    return prompt

以及其他模型load时在pred.py 43行需要使用AutoModelForCausalLM方法。

Error with multi-gpus

在测试llama2时，device_map=''auto'' 即多gpu会导致flash-attn部分报错，但在另一台机器上测试的时候没有问题，其他模型也没有问题，报错如下：
RuntimeError: CUDA error: invalid configuration argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

*** Error in `python': free(): invalid pointer: 0x0000557f6c3bedf0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x79a1c)[0x7f72132eaa1c]
/lib64/libc.so.6(+0x7f498)[0x7f72132f0498]
/lib64/libc.so.6(+0x8007c)[0x7f72132f107c]
/usr/local/anaconda3/envs/py39/lib/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so(+0x390e17)[0x7f720bcb3e17]
python(+0x136d16)[0x557f6aa9cd16]
python(+0x2767c8)[0x557f6abdc7c8]
python(Py_FinalizeEx+0x1a4)[0x557f6abdc984]
python(Py_RunMain+0x111)[0x557f6abe3331]
python(Py_BytesMain+0x39)[0x557f6abe3729]
/lib64/libc.so.6(__libc_start_main+0xee)[0x7f7213295bde]
python(+0x203667)[0x557f6ab69667]
======= Memory map: ========
200000000-200200000 ---p 00000000 00:00 0
200200000-200400000 rw-s 00000000 00:05 92236 /dev/nvidiactl
200400000-200600000 rw-s 00000000 00:05 33913 /dev/nvidia5
200600000-203600000 rw-s 00000000 00:05 92236 /dev/nvidiactl
203600000-204600000 ---p 00000000 00:00 0
204600000-204800000 rw-s 00000000 00:05 92236 /dev/nvidiactl
204800000-204a00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
204a00000-204c00000 rw-s 204a00000 00:05 81474 /dev/nvidia-uvm
204c00000-204e00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
204e00000-205000000 ---p 00000000 00:00 0
205000000-205200000 rw-s 00000000 00:05 92236 /dev/nvidiactl
205200000-205600000 ---p 00000000 00:00 0
205600000-205800000 rw-s 00000000 00:05 92236 /dev/nvidiactl
205800000-205a00000 rw-s 00000000 00:05 92254 /dev/nvidia6
205a00000-208a00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
208a00000-209a00000 ---p 00000000 00:00 0
209a00000-209c00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
209c00000-209e00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
209e00000-20a000000 rw-s 209e00000 00:05 81474 /dev/nvidia-uvm
20a000000-20a200000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20a200000-20a400000 ---p 00000000 00:00 0
20a400000-20a600000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20a600000-20aa00000 ---p 00000000 00:00 0
20aa00000-20ac00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20ac00000-20ae00000 rw-s 00000000 00:05 78022 /dev/nvidia7
20ae00000-20de00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20de00000-20ee00000 ---p 00000000 00:00 0
20ee00000-20f000000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20f000000-20f200000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20f200000-20f400000 rw-s 20f200000 00:05 81474 /dev/nvidia-uvm
20f400000-20f600000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20f600000-20f800000 ---p 00000000 00:00 0
20f800000-20fa00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
20fa00000-500200000 ---p 00000000 00:00 0
10000000000-1000c000000 ---p 00000000 00:00 0
557f6a966000-557f6a9c2000 r--p 00000000 08:01 1612918380 /usr/local/anaconda3/envs/py39/bin/python3.9
557f6a9c2000-557f6abe6000 r-xp 0005c000 08:01 1612918380 /usr/local/anaconda3/envs/py39/bin/python3.9
557f6abe6000-557f6acd2000 r--p 00280000 08:01 1612918380 /usr/local/anaconda3/envs/py39/bin/python3.9
557f6acd2000-557f6acd8000 r--p 0036b000 08:01 1612918380 /usr/local/anaconda3/envs/py39/bin/python3.9
557f6acd8000-557f6ad0f000 rw-p 00371000 08:01 1612918380 /usr/local/anaconda3/envs/py39/bin/python3.9
557f6ad0f000-557f6ad31000 rw-p 00000000 00:00 0
557f6c1ef000-5580341cb000 rw-p 00000000 00:00 0 [heap]
7f69e0000000-7f69f6000000 ---p 00000000 00:00 0
7f69f8000000-7f6a12000000 ---p 00000000 00:00 0
7f6a14000000-7f6a2e000000 ---p 00000000 00:00 0
7f6a30000000-7f6a42000000 ---p 00000000 00:00 0
7f6ad2000000-7f6ae8000000 ---p 00000000 00:00 0
7f6aea000000-7f6af6000000 ---p 00000000 00:00 0
7f6b5a000000-7f6b77800000 ---p 00000000 00:00 0
7f6b77800000-7f6b77bf1000 rw-s 00000000 00:04 4122933 /dev/zero (deleted)
7f6b77bf1000-7f6bc4000000 ---p 00000000 00:00 0
7f6bc4000000-7f6bc4021000 rw-p 00000000 00:00 0
7f6bc4021000-7f6bc8000000 ---p 00000000 00:00 0
7f6bca000000-7f6bd8000000 ---p 00000000 00:00 0
7f6bda000000-7f6be7a00000 ---p 00000000 00:00 0
7f6be7a00000-7f6be7df1000 rw-s 00000000 00:04 4138955 /dev/zero (deleted)
7f6be7df1000-7f6bf4000000 ---p 00000000 00:00 0
7f6bf6000000-7f6c10000000 ---p 00000000 00:00 0
7f6c12000000-7f6c2c000000 ---p 00000000 00:00 0
7f6c2e000000-7f6c48000000 ---p 00000000 00:00 0
7f6c4a000000-7f6c64000000 ---p 00000000 00:00 0
7f6c66000000-7f6c88000000 ---p 00000000 00:00 0
7f6c8a000000-7f6c92000000 ---p 00000000 00:00 0
7f6c94000000-7f6ca0000000 ---p 00000000 00:00 0
7f6ca2000000-7f6cbc000000 ---p 00000000 00:00 0
7f6cbe000000-7f6cd8000000 ---p 00000000 00:00 0
7f6cda000000-7f6cf4000000 ---p 00000000 00:00 0
7f6cf6000000-7f6d10000000 ---p 00000000 00:00 0
7f6d12000000-7f6d2c000000 ---p 00000000 00:00 0
7f6d2e000000-7f6d48000000 ---p 00000000 00:00 0
7f6d4a000000-7f6d64000000 ---p 00000000 00:00 0
7f6d66000000-7f6d80000000 ---p 00000000 00:00 0
7f6d82000000-7f6d9c000000 ---p 00000000 00:00 0
7f6d9e000000-7f6db8000000 ---p 00000000 00:00 0
7f6dba000000-7f6dd4000000 ---p 00000000 00:00 0
7f6dd6000000-7f6df0000000 ---p 00000000 00:00 0
7f6df2000000-7f6e0c000000 ---p 00000000 00:00 0
7f6e0e000000-7f6e28000000 ---p 00000000 00:00 0
7f6e2a000000-7f6e44000000 ---p 00000000 00:00 0
7f6e46000000-7f6e60000000 ---p 00000000 00:00 0
7f6e62000000-7f6e7c000000 ---p 00000000 00:00 0
7f6e7e000000-7f6e98000000 ---p 00000000 00:00 0
7f6e9a000000-7f6eb4000000 ---p 00000000 00:00 0
7f6eb6000000-7f6ed8000000 ---p 00000000 00:00 0
7f6eda000000-7f6ee2000000 ---p 00000000 00:00 0
7f6ee4000000-7f6f09000000 ---p 00000000 00:00 0
7f6f09000000-7f6f09200000 rw-s 00000000 00:04 3855178 /dev/zero (deleted)
7f6f09200000-7f6f1c000000 ---p 00000000 00:00 0
7f6f1c000000-7f6f1c021000 rw-p 00000000 00:00 0
7f6f1c021000-7f6f20000000 ---p 00000000 00:00 0
7f6f22000000-7f6f22800000 ---p 00000000 00:00 0
7f6f22800000-7f6f22a00000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f22a00000-7f6f22c00000 rw-s 00000000 00:04 3855174 /dev/zero (deleted)
7f6f22c00000-7f6f22e00000 rw-s 00000000 00:04 3855175 /dev/zero (deleted)
7f6f22e00000-7f6f23400000 ---p 00000000 00:00 0
7f6f23400000-7f6f23600000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f23600000-7f6f23800000 rw-s 00000000 00:04 3855177 /dev/zero (deleted)
7f6f23800000-7f6f23b33000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f23b33000-7f6f43000000 ---p 00000000 00:00 0
7f6f43000000-7f6f43200000 rw-s 00000000 00:04 3855173 /dev/zero (deleted)
7f6f43200000-7f6f56200000 ---p 00000000 00:00 0
7f6f56200000-7f6f56400000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f56400000-7f6f56600000 rw-s 00000000 00:04 3855172 /dev/zero (deleted)
7f6f56600000-7f6f56933000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f56933000-7f6f58000000 ---p 00000000 00:00 0
7f6f58000000-7f6f58021000 rw-p 00000000 00:00 0
7f6f58021000-7f6f5c000000 ---p 00000000 00:00 0
7f6f5e000000-7f6f5f600000 ---p 00000000 00:00 0
7f6f5f600000-7f6f5f800000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6f5f800000-7f6f5fa00000 rw-s 00000000 00:04 3855169 /dev/zero (deleted)
7f6f5fa00000-7f6f5fc00000 rw-s 00000000 00:04 3855170 /dev/zero (deleted)
7f6f5fc00000-7f6f66000000 ---p 00000000 00:00 0
7f6f67562000-7f6f6aab1000 rw-p 00000000 00:00 0
7f6f6e000000-7f6f78000000 ---p 00000000 00:00 0
7f6f78cb5000-7f6f7a000000 rw-p 00000000 00:00 0
7f6f7a000000-7f6f89000000 ---p 00000000 00:00 0
7f6f89000000-7f6f89200000 rw-s 00000000 00:04 3855168 /dev/zero (deleted)
7f6f89200000-7f6f9c000000 ---p 00000000 00:00 0
7f6f9c000000-7f6f9c021000 rw-p 00000000 00:00 0
7f6f9c021000-7f6fa0000000 ---p 00000000 00:00 0
7f6fa0df8000-7f6fa2000000 rw-p 00000000 00:00 0
7f6fa2000000-7f6fa2400000 ---p 00000000 00:00 0
7f6fa2400000-7f6fa2600000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6fa2600000-7f6fa2800000 rw-s 00000000 00:04 3855163 /dev/zero (deleted)
7f6fa2800000-7f6fa2a00000 rw-s 00000000 00:04 3855164 /dev/zero (deleted)
7f6fa2a00000-7f6fa3000000 ---p 00000000 00:00 0
7f6fa3000000-7f6fa3200000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6fa3200000-7f6fa3400000 rw-s 00000000 00:04 3855166 /dev/zero (deleted)
7f6fa3400000-7f6fa3733000 rw-s 00000000 00:05 92236 /dev/nvidiactl
7f6fa3733000-7f6fa4000000 ---p 00000000 00:00 0
7f6fa4000000-7f6fa4029000 rw-p 00000000 00:00 0
7f6fa4029000-7f6fa8000000 ---p 00000000 00:00 0
7f6fa8000000-7f6fa8029000 rw-p 00000000 00:00 0
7f6fa8029000-7f6fac000000 ---p 00000000 00:00 0
7f6fac000000-7f6fac029000 rw-p 00000000 00:00 0
7f6fac029000-7f6fb0000000 ---p 00000000 00:00 0
7f6fb0000000-7f6fb0029000 rw-p 00000000 00:00 0
7f6fb0029000-7f6fb4000000 ---p 00000000 00:00 0
7f6fb4000000-7f6fb4029000 rw-p 00000000 00:00 0
7f6fb4029000-7f6fb8000000 ---p 00000000 00:00 0
7f6fb8000000-7f6fb8029000 rw-p 00000000 00:00 0
7f6fb8029000-7f6fbc000000 ---p 00000000 00:00 0
7f6fbc000000-7f6fbc029000 rw-p 00000000 00:00 0
7f6fbc029000-7f6fc0000000 ---p 00000000 00:00 0
7f6fc0000000-7f6fc0029000 rw-p 00000000 00:00 0
7f6fc0029000-7f6fc4000000 ---p 00000000 00:00 0
7f6fc4000000-7f6fc4029000 rw-p 00000000 00:00 0
7f6fc4029000-7f6fc8000000 ---p 00000000 00:00

requreiments.txt

可以给一下安装所需requirements吗？谢谢🙏！

pred.py中的typo

pred.py的get_pred函数中: if "chatglm3" in model: 应该改成 if "chatglm3" in model_name:

如何评测GPT-3.5或GPT-4

好像只提供了开源模型的预测和评测，所以请想问下如何复现GPT的评测结果呢？

pred.py 代码有问题，for json_obj in tqdm(data[:10]) 需要删除[:10]

报错TypeError: Couldn't cast array of type list<item: string> to null

服务器没办法链接huggingface，只是将pred.py中THU/Longbench的路径换成了本地的/home/eval/LongBench/data,config文件中的模型路径也已经添加，报错如下
CUDA_VISIBLE_DEVICES=7 python pred.py --model llama2-13b-chat-16k
Resolving data files: 100%|████████████████████████████████████| 34/34 [00:00<00:00, 149169.81it/s]
Downloading data files: 100%|██████████████████████████████████████| 1/1 [00:00<00:00, 1417.95it/s]
Extracting data files: 100%|█████████████████████████████████████████| 1/1 [00:00<00:00, 87.24it/s]
Generating train split: 2500 examples [00:00, 4816.93 examples/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/arrow_writer.py", line 572, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 2328, in table_cast
return cast_table_to_schema(table, schema)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 2287, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 2287, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 1831, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 2143, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 1833, in wrapper
return func(array, *args, **kwargs)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/table.py", line 2028, in array_cast
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
TypeError: Couldn't cast array of type
list<item: string>
to
null

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/zyx/eval/LongBench/pred.py", line 163, in
data = load_dataset('/root/zyx/eval/LongBench/data/data', dataset, split='test')
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/root/miniconda3/envs/zyx/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

请问如何解决呢

llama2 chat's template

https://github.com/THUDM/LongBench/blob/main/pred.py#L29C10-L29C10

According to this link https://gpus.llm-utils.org/llama-2-prompt-template/, llama2 chat's prompt format is different from your code.
<s>[INST] {user_message_1} [/INST]

关于截断长度的公平性问题

在首页的表格中，Llama2-7B-chat-4k的英文平均效果是29.0，ChatGLM2-6B是26.0，也就是Llama2-7B-chat-4k超过了ChatGLM2-6B。

但Llama2-7B-chat-4k只是一个4k模型，看pred.py的话，ChatGLM2-6B去到了31.5k，换言之ChatGLM2-6B是真的long context来处理的，而根据首页的说法，Llama2-7B-chat-4k应该是将context通过从中间截断来控制长度在3.5k内？如果是的话，这两者的比较是否公平？或者更准确地说，这两者的比较是否有意义？

Llama2-7B-chat-4k效果更好，说明答案的关键信息多数还是集中在context的首尾，如果所有模型都从中间截断到3.5k内，那么说不准靠后的模型都能获得较大提升，但总感觉很奇怪。当然，从结果论来看，目前的做法没有什么不妥。我不是想质疑什么，只是提出这个疑问，目前我也没有什么好的解决办法。估计是目前的测试数据偏爱于从中间截断这一预处理手段，以至于无法评估出真正有价值的long contex技术。

directly cutting from the middle seems unfair

It isn't a fair comparison to me if the prompt is cut from the middle.

If the answer resides in the middle of the prompt, doing this will result in an absolute error. If the answer isn't cut away, the task is literally made easier by removing answer-unrelated messages. So LongBench actually can't provide a fair comparison of all models long context ability (those models with shorter context length are evaluated on easier tasks)

Thanks for your reply!

Any Implementation of Mistral-7B?

Hi, do you report the Mistral-7B results? Thank you!