thunlp / infllm Goto Github PK

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

License: MIT License

Python 97.94% Shell 2.06%

large-language-models llm long-context training-free

infllm's Introduction

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory" [pdf].

Updates

March 3, 2024: Initial code release. See init.
March 24, 2024: Refactor the code. Improve inference speed and reduce GPU memory usage.
April 4, 2024: Supports topk retrieval using faiss.
April 20, 2024: Added support for LLaMA 3.

Overview

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs, such as LLM-driven agents. However, existing LLMs, pre-trained on sequences with restricted maximum length, cannot generalize to longer sequences due to the out-of-domain and distraction issues. To alleviate these issues, existing efforts employ sliding attention windows and discard distant tokens to achieve the processing of extremely long sequences. Unfortunately, these approaches inevitably fail to capture long-distance dependencies within sequences to deeply understand semantics. This paper introduces a training-free memory-based method, InfLLM, to unveil the intrinsic ability of LLMs to process streaming long sequences. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences while maintaining the ability to capture long-distance dependencies. Without any training, InfLLM enables LLMs pre-trained on sequences of a few thousand tokens to achieve superior performance than competitive baselines continually training these LLMs on long sequences. Even when the sequence length is scaled to 1, 024K, InfLLM still effectively captures long-distance dependencies.

Requirements

torch>=1.13.1
transformers>=4.37.2
fschat>=0.2.35
datasets>=2.17.0
omegaconf
flash-attn

rouge==1.0.1
fuzzywuzzy==0.18.0
jieba==0.42.1

Usage

Configuration

We use YAML files for configuration, and you can see the configuration files we use for benchmark in the config/ directory.

The description of the configuration files is as follows:

model: 
  # attention type. 
  # inf-llm/infinite-lm/stream-lm/origin(full attention)
  type: inf-llm 

  # huggingface or model-center model path
  path: mistralai/Mistral-7B-Instruct-v0.2 

  # Use flash-attention or not. 
  # For inf-llm/infinite-lm/stream-llm, we implemented multi-stage flash-attention by OpenAI's Triton.
  fattn: false 
  
  # RoPE base and distance_scale
  base: 1000000
  distance_scale: 1.0

  # inf-llm/infinite-lm/stream-lm settings

  # Initital tokens as attention sinks
  n_init: 128   
  # Local sliding window size
  n_local: 4096 

  # inf-llm settings

  # Number of memory units to retrieve for attention computation.
  topk: 16  
  # The number of top-scoring tokens per memory unit considered as representative elements. 
  repr_topk: 4 
  # Maximum number of memory units stored in GPU memory. 
  max_cached_block: 32
  # Number of tokens queried at a time as an execution block.
  # Each execution block retrieves topk memory units once.
  exc_block_size: 512
  
  # The strategy for replacing cached memory units. 
  # Supported strategies include LRU (Least Recently Used), FIFO (First In, First Out), 
  # and LRU-S (LRU in our paper).
  cache_strategy: lru

  # score_decay for LRU-S
  # score_decay: 0.1

  # Use overlap local and global calculation.
  # Can accelerate, but may not be compatible.
  async_global_stream: false

  # Use faiss for topk retrieval of memory units. 
  # It will increase inference time and ensure constant GPU memory usage.
  faiss: false 

  # Use perhead topk. 
  # Enabling it will be very time-consuming and is intended for research use only.
  # perhead: false

# Model max input length.
# A truncation will be employed if the input length exceeds.
max_len: 2147483647

# truncation type. Now supports suffix only.
truncation: suffix

# Chunked input in decoding.
# To save GPU memory. (FFN block)
chunk_size: 8192

# Conversation type. 
# mistral-inst/vicuna/qwen/minicpm/llama-3-inst
conv_type: mistral-inst

Evaluation

Data Preparation We adopt InfiniteBench and LongBench for model evaluation. You can download the datasets by running the following command.

bash scripts/download.sh

Response Generation You can evaluate InfLLM by running the following command. Notably, the provided code is used to run evaluate with only one GPU, and you can accelerate the experiments with multiple GPUs.

bash scripts/[infinitebench,longbench].sh

Run a Chatbot with InfLLM

We integrated fastchat's CLI chat.

python -m inf_llm.chat \
    --model-path mistralai/Mistral-7B-Instruct-v0.2 \
    --inf-llm-config-path config/mistral-inf-llm.yaml

Citation

If you find InfLLM useful, please cite the following paper:

@article{xiao2024infllm,
  author       = {Chaojun Xiao and Pengle Zhang and Xu Han and Guangxuan Xiao and Yankai Lin and Zhengyan Zhang and Zhiyuan Liu and Song Han and Maosong Sun},
  title        = {InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding
                  Extremely Long Sequences with Training-Free Memory},
  journal      = {arXiv},
  year         = {2024}
}

infllm's People

Contributors

Stargazers

Watchers

infllm's Issues

是否支持Qwen1.5-7B的量化版本？

非常好的工作！请问InfLLM是否支持Qwen1.5-7B的量化版本？

Representative Score计算与Memory Lookup的实现细节？

您好，论文3.2节关于Representative Score与Memory Lookup的描述，似乎是将token的query/key当成一个vector。请问在具体实现中是如何处理多层与多头注意力的query/key呢？

Hi, the description in section 3.2 about the Representative Score and Memory Lookup seems to treat the token’s query/key as a single vector. How do you handle query/key for multi-layer and multi-head attention?

cuda error

cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()

接入一个新的模型需要满足哪些条件

你好，这篇文章和代码实现细看了下，如果我想接入一个新的模型来支持infLLM，需要哪些满足条件，个人理解：

位置编码：新的模型attention内部也必须是RotaryEmbeddingESM编码方式，否则训练的模型与infLLM推理在位置编码等效性上就不一致了；
新模型的model.model.forward与InfLLM的model_forward的实现逻辑必须完全一样；
新模型的Attention的推理计算入参格式必须满足如下
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) ，目的是为了与InfLLM定义的hf_forward的入参完全保持一致；
感觉应该够了，对吧，还需要其他硬性满足条件吗？
如确实如此，那么接入一个新的开源模型应该很容易对吧？为什么我看你这边只接入了LlamaForCausalLM，MistralForCausalLM，Qwen2ForCausalLM这三个？

Qwen1.5-7B-Chat CUDA error: out of memory

机器配置：A800 80G，机器内存360G
配置文件：

model:
  type: inf-llm
  path: Qwen/Qwen1.5-7B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 512
conv_type: qwen

修改pred进行推理，输入token长度大约为28W左右【token长度在19W以内是不会报错的】
报错信息

Traceback (most recent call last):
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 325, in <module>
    preds = get_pred(
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 271, in get_pred
    output = searcher.generate(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
    result = self._decode(input_ids, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
    out = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
    layer_outputs = decoder_layer(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
    ret = forward(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/inf_llm.py", line 58, in forward
    o = past_key_value.append(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 725, in append
    self.append_global(ed - st, kv_ed - kv_st, local_score)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 620, in append_global
    MemoryUnit(self.global_remainder[0][u, :, global_remainder_st:global_remainder_st + self.block_size, :],
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 34, in __init__
    cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()
RuntimeError: CUDA error: out of memory

请问如何解决这个问题，我看显存最大也就是30+G的占用，是哪里出的问题呢？

是否可以添加faiss向量库的保存和加载

在上次对话结束后保存储存kv_cache的faiss，然后可以在下次对话中载入上次会话的kv_cache faiss

OutOfResources: out of resource: shared memory, Required: 151680, Hardware limit: 101376.

Can this be solved by adjusting the configuration parameters? If so, which one?
I'm load_in_4bit=True
config.json

model:
  type: inf-llm
  path: IA-14B-Chat2
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 8192
conv_type: mistral-inst

Traceback (most recent call last):
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 125, in <module>
    chat(config)
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 120, in chat
    conv.append(t)
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 71, in append
    gen_text = self.searcher.generate(input_ids = new_tokens, max_length=self.max_gen, chunk_size=self.chunk_size, output=True,extra_end_token_ids=[self.tokenizer.bos_token_id,self.tokenizer.pad_token_id,self.tokenizer.eos_token_id], top_k=20, top_p=0.9, temperature=0.95, do_sample=True, repetition_penalty=1.05)[0]
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 33, in generate
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 55, in _decode
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 97, in model_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 16, in hf_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/inf_llm.py", line 54, in forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 558, in append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 520, in _append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 333, in calc_result_and_score
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 364, in mq_attn_triton
    return _attention.apply(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 312, in forward
    o, m, l = _forward(q1, k1, v1, mask1, sm_scale, sliding_window=sliding_window1)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 246, in _forward
    _attn_fwd[grid](
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 232, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 65, in _attn_fwd
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/compiler/compiler.py", line 579, in __getattribute__
    self._init_handles()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/compiler/compiler.py", line 568, in _init_handles
    raise OutOfResources(self.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 151680, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

Implementation of `Stream` and `Infinite`?

Congrats for the nice work.

I see streamingLLM and InfiniteLLM are used in your experiments.

Have you developled your own implementation for stream and Infinite? The original streamingLLM is only made for decoing phrase and thus not suitable for prefilling. Are you has a prefilling version of implementation?

Thanks!

Code licence

Amazing project! Under what license can code from this repository be used?

More generate parameters

Such as top_p,top_k,temperature,repetition_penalty,do_sample,beam_search,etc

ZERO Score when using Origin settings

Hi, this is a great job! But I met some strange problems:

when I use mistral-origin.yaml and vicuna-origin.yaml, evaluation scores are 0.

More specifically, the pred of mistral-origin.yaml are many star symblos, like:
"* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * "

and the pred of vicuna-origin.yaml are "nobody"s, like:
"nobody nobody nobody nobody nobody nobody nobody nobody nobody nobody nobody "

I have reproduced these results 3 times.

So, has anyone else met this problem?

transformers==4.37.2
torch==2.1.0

Using A100-80G

Qwen1.5-72B-Chat-GPTQ-Int4

请问下，能直接跑 Qwen1.5-72B-Chat-GPTQ-Int4 模型吗？

Support CohereForAI/c4ai-command-r-v01

Hi,

Can this support CommandR models from Cohere ? (CohereForAI/c4ai-command-r-v01). I would be happy to sponsor this development.

Thanks.

qwen系列模型支持

能否支持下qwen系列模型

OOM issue

great work.
When I try to run a 13B model with more than 100K context length of Passkey retrival task, it throws 'OOM' issue.
Will the code itself support multi-gpu inference?

ValueError: Only supports llama, mistral and qwen2 models.

from inf_llm.utils import patch_hf
from transformers import AutoModel

def load_yaml_config(file_path='path_to_your_config_file.yaml'):
    """ Load a YAML configuration file. """
    with open(file_path, 'r') as file:
        return yaml.safe_load(file)


# Load the configuration for infinite context
config_path = 'minicpm-inf-llm.yaml'
with open(config_path, 'r') as file:
    inf_llm_config = yaml.safe_load(file)
inf_llm_config

from inf_llm.utils import patch_hf
config = load_yaml_config(file_path=config_path)['model']
model = patch_hf(model, config['type'], **config)

produces

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 3
      1 from inf_llm.utils import patch_hf
      2 config = load_yaml_config(file_path=config_path)['model']
----> 3 model = patch_hf(model, config['type'], **config)

File /home/user/mamba/InfLLM/inf_llm/utils/patch.py:150, in patch_hf(model, attn_type, attn_kwargs, base, distance_scale, **kwargs)
    148     Model = model.model.__class__
    149 else:
--> 150     raise ValueError("Only supports llama, mistral and qwen2 models.")
    152 hf_rope = model.model.layers[0].self_attn.rotary_emb 
    153 base = base if base is not None else hf_rope.base

ValueError: Only supports llama, mistral and qwen2 models.

多卡支持问题

非常棒的工作，我们验证了 Qwen1.5-7B-Chat 在 ∞-Bench 数据集上的表现也是非常的出色。看之前的issue中，你们这边也提到暂无计划支持多卡，想请问下 InfLLM支持多卡的主要的难点在哪里？我们想尝试下支持下多卡，请问能否提供一些思路指导？

推理时间问题

使用qwen-14b-chat模型，MultiFieldQA-zh数据集，单卡A100上跑，推理时间几乎增加了2倍

原qwen-14b-chat
qwen-14b-chat + infllm

`Position Emb` and `Chunk size`

Great job, I found two problems when trying to reproduce the paper's results.

The same positiona embedding was used for all context memory units as explained in the paper. But I found in code implementation, there seems no use of position embedding for cached Ks at all?
Why chunk size? The proposed method does the attention block by block, which (I think) wouldn't cause OOM errors even without the chunking trick in decoding. But I found it fail to process 100K text without setting chunk size, while using flash attn is totaly fine in such circumstances.

我看代码会不定期更新，做了什么改动可以加到readme里吗

How to use w transformers?

I use transformers with a custom script, I see you show how to use this with a custom fast chat script

Do you have boilerplate code on how to wrap a transformers pipeline to use w this?

准确率问题

请问调用bash infinitebench.sh，在longbook_choice_eng数据集上测了50条数据，准确率只有40%左右，这个结果是不是有问题，是中间过程出了什么问题吗

IndexErrors when attempting to run triton flashattention

Hi, I am attempting to run the mistral-inf-llm-fattn on a single v100, but I am getting indexerrors. Do you have any indicators on what the problem might be? Below is the full output:

Exception has occurred: IndexError
map::at
  File "/home/aoomerjee/EM-LLM/inf_llm/attention/dot_production_attention/triton_impl.py", line 430, in _forward
    _attn_fwd[grid](
  File "/home/aoomerjee/EM-LLM/inf_llm/attention/dot_production_attention/triton_impl.py", line 534, in append
    o, m, l = _forward(
              ^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/attention/inf_llm_context_manager.py", line 555, in _retrieve_and_attend
    attn.append(
  File "/home/aoomerjee/EM-LLM/inf_llm/attention/inf_llm_context_manager.py", line 779, in append
    exc_block_attn_output, exc_repr_score = self._retrieve_and_attend(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/attention/inf_llm.py", line 64, in forward
    o = past_key_value.append(
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/utils/patch_hf.py", line 19, in hf_forward
    ret = forward(
          ^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/utils/patch_hf.py", line 87, in model_forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/utils/greedy_search.py", line 46, in _model_pass
    out = self.model(
          ^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/utils/greedy_search.py", line 74, in _decode
    out = self._model_pass(
          ^^^^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/inf_llm/utils/greedy_search.py", line 32, in generate
    result = self._decode(input_ids, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/benchmark/pred.py", line 259, in get_pred
    output = searcher.generate(
             ^^^^^^^^^^^^^^^^^^
  File "/home/aoomerjee/EM-LLM/benchmark/pred.py", line 324, in <module>
    preds = get_pred(
            ^^^^^^^^^
IndexError: map::at

running with infllm-12k.yaml meets errors

Traceback (most recent call last):
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/pred.py", line 327, in
preds = get_pred(
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/pred.py", line 260, in get_pred
output = searcher.generate(
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/utils/greedy_search.py", line 32, in generate
result = self._decode(input_ids, **kwargs)
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/utils/greedy_search.py", line 54, in _decode
out = self.model(
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1162, in forward
outputs = self.model(
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/utils/patch.py", line 102, in model_forward
layer_outputs = decoder_layer(
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 757, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/llm/miniconda3/envs/llm_infer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/utils/patch.py", line 16, in hf_forward
ret = forward(
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/inf_llm.py", line 64, in forward
o = past_key_value.append(
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/context_manager.py", line 774, in append
chunk_o, local_score = self._append(
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/context_manager.py", line 520, in _append
global_h_k, global_h_v, global_sliding_window, global_block_map, global_block_num = self.get_global_hidden_and_mask(local_h_q.size(-2), block_topk)
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/context_manager.py", line 419, in get_global_hidden_and_mask
self.global_blocks[u][b_idx].load((global_h_k[u, :, st:ed, :], global_h_v[u, :, st:ed, :]))
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/context_manager.py", line 76, in load
gpu_data, gpu_data_id = self.cache.alloc()
File "/dataset-vlm/jingyaoli/LLMInfer/InfLLM/benchmark/inf_llm/attention/context_manager.py", line 19, in alloc
assert len(self.idle_set) > 0
AssertionError

能否将所有的kv缓存存到faiss向量数据库里

能否将所有的kv缓存存到faiss向量数据库里，以节省gpu显存达到无限上下文长度，具体可以参考https://github.com/Victorwz/LongMem

能否支持多卡推理

关于longbench测试问题。

使用原始配置运行longbench.sh，测试结果如上所示，和论文中的差距较大，不知是否正常？

config=config/mistral-inf-llm.yaml

datasets="narrativeqa,qasper,multifieldqa_en,
hotpotqa,2wikimqa,musique,
gov_report,qmsum,multi_news,
trec,triviaqa,samsum,
passage_count,passage_retrieval_en,
lcc,repobench-p"

mkdir benchmark/longbench-result

python3 benchmark/pred.py
--config_path ${config}
--output_dir_path benchmark/longbench-result
--datasets ${datasets}

python3 benchmark/eval.py --dir_path benchmark/longbench-result

longbench

Hi, there are some differences between the longbench indicator I ran and the paper.
Is there any wrong setting in my experiment?
Thanks.

model:
type: inf-llm
path: Mistral-7B-Instruct-v0.2
fattn: false
block_size: 128
base: 1000000
distance_scale: 1.0
n_init: 128
n_local: 4096
topk: 16
repr_topk: 4
max_cached_block: 32
exc_block_size: 512
async_global_stream: true
cache_strategy: lru
faiss: false

max_len: 2147483647
truncation: suffix
chunk_size: 8192

conv_type: mistral-inst

内存占用会不会很大

代码实现的疑问

作者你好，在看代码过程中有一些问题：
1、global_q为什么都增加了local的位置编码
2、global_k、global_v用来计算global score的分数大致原理是？

Needle In A Haystack test?

Any result with Needle In A Haystack test?

长文本报错

调用chat.py往里面写入内容后，报错。

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 4093-4094: unexpected end of data

显存占用问题

qwen1.5-14b模型用一张80g的A100爆显存了，这个需要这么大的显存吗

你好，有没有尝试过只在某些层进行lookup，这样就可以减少缓存KV的数量

https://arxiv.org/pdf/2203.08913.pdf

实际测试效果较差

您好，您的工作确实很有创意。我剥离了原有的评估代码，使用Qwen1.5-14B-chat 进行测试。当提供的文本长度达到30000字符长度时，InfLLM已经开始无法准确的回答问题。当文本继续增大时，输出错误加剧。能帮忙看看调整哪些参数可以优化结果

yaml 文件如下：
model:
type: inf-llm
path: /data/public/LLM/basemodels/qwen_1_5/Qwen1.5-14B-Chat/
block_size: 128
n_init: 128
n_local: 4096
topk: 16
repr_topk: 4
max_cached_block: 32
exc_block_size: 512
fattn: True
base: 1000000
distance_scale: 1.0

max_len: 2147483647
chunk_size: 2048
conv_type: qwen

server 脚本

lnFLLM_server.txt

Qwen1.5-72B-chat-AWQ with longbench and infinibench benchmark OOM with A100 80G

When I test Qwen1.5-72B-chat-AWQ with
bash scripts/longbench.sh it turns out to OOM with A100 80G

My config:
model:
type: inf-llm
path: /root/czh/quant_models/Qwen2-geogpt-72b-0412-awq-dde-12000
block_size: 128
n_init: 128
n_local: 4096
topk: 16
repr_topk: 4
max_cached_block: 32
exc_block_size: 512
fattn: false
base: 1000000
distance_scale: 1.0

max_len: 2147483647
chunk_size: 2048
conv_type: qwen

The Traceback is as follows:
Traceback (most recent call last):
File "/root/czh/InfLLM/benchmark/pred.py", line 321, in
preds = get_pred(
File "/root/czh/InfLLM/benchmark/pred.py", line 256, in get_pred
output = searcher.generate(
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
result = self._decode(input_ids, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
out = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1169, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
ret = forward(
File "/root/czh/InfLLM/inf_llm/attention/inf_llm.py", line 64, in forward
o = past_key_value.append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 774, in append
chunk_o, local_score = self._append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 526, in _append
attn.append(
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 96, in append
self.finalize()
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 22, in finalize
tmp = torch.masked_fill(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 190.19 MiB is free. Process 3985934 has 78.95 GiB memory in use. Of the allocated memory 75.61 GiB is allocated by PyTorch, and 2.82 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/local/lib/python3.10/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Evaluating on: ['result.json']
{}
Can someone help with this issue? Thanks!

model:
  type: inf-llm
  path: /data/model/open_source_data/Qwen/Qwen1.5-7B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 4096 (8192 也是爆显存)
conv_type: qwen

请问如何debug下每次中间是哪些token位置被选中进行拼接的？

RT
请问如何debug下每次中间是哪些token位置被选中进行拼接的？能否告诉下相关代码的位置

thunlp / infllm Goto Github PK

infllm's Introduction

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

Updates

Quick Links

Overview

Requirements

Usage

Configuration

Evaluation

Run a Chatbot with InfLLM

Citation

infllm's People

Contributors

Stargazers

Watchers

Forkers

infllm's Issues

server 脚本

Recommend Projects

Recommend Topics

Recommend Org