fasterdecoding / snapkv Goto Github PK

View Code? Open in Web Editor NEW

133.0 4.0 4.0 871 KB

Python 94.30% Jupyter Notebook 5.48% Shell 0.21%

long-context-modeling

snapkv's Introduction

SnapKV 📷

We introduce an innovative and out-of-box KV cache compression method, SnapKV.

Requirements

Currently tested with transformers==4.37.0, need to check if it is compatible with higher version.

transformers>=4.36
flash-attn==2.4.0

Installation

git clone [email protected]:FasterDecoding/SnapKV.git
cd SnapKV
pip install -e .

Quick Start

Use SnapKV-optimized Models

For example:

from snapkv.monkeypatch.monkeypatch import replace_mistral
replace_mistral() # Use monkey patches enable SnapKV

Check the example notebook.

Customize Your SnapKV-optimized Models

SnapKV can be easily integrated with other models.

You can follow the comment marked with [SnapKV] in existing models to construct your own models. (Currently we support Llama family/ Mistral/ Mixtral)

The detailed algorithm of SnapKV is in snapkv_utils.py

Partial Results

TODO

Add observation experiments for reduplication.
Add LongBench for reduplication.
Explore the prompt phase compression.

Citation

If you feel this project is helpful, please consider cite our report 😊

@article{li2024snapkv,
  title={SnapKV: LLM Knows What You are Looking for Before Generation},
  author={Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming},
  journal={arXiv preprint arXiv:2404.14469},
  year={2024}
}

snapkv's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory sanjibnarzary machinelearningsystem linyubupa

snapkv's Issues

What prompt was used in Needle in a Haystack test?

I try to reproduce needle test with LWM-Text-Chat-1M but the model just refuse to answer. I have tried following prompts in Needle test and the model just generate </s>

<s>[INST] <<SYS>>
You are a helpful AI bot that answers questions for a user. Keep your response short and direct
<</SYS>>
{ context }

{retrieval_question} Don't give information outside the document or repeat your findings
[/INST]

and

<s>[INST] <<SYS>>
You are a helpful AI bot that answers questions for a user. Keep your response short and direct
<</SYS>>
{ context }

{retrieval_question} Don't give information outside the document or repeat your findings
[/INST]</s>

Questions on paper and code [prompting for mistral, positional index, minor errors & questions in paper]

Hello :)
Thank you for the excellent work and for sharing your code. I've learned a lot and have a few questions about the paper and settings:

In Figures 2 and 3, what specifically do "prompt" and "context" represent? My guess is that "prompt" refers to the entire input sequence length, and "context" includes specific instructions. Should their labels be switched?
Could you share the specific prompt details applied in the Mistral experiment for measuring LongBench performance? Using the default LongBench settings, I observed lower performance overall, particularly in Qasper:
- For Mistral-v2: Full: 28.92, SnapKV 2048: 26.43, 4096: 28.42 (reported: 33.06/32.47/33.36 respectively).
- Intuitively, I think that sending the task-specific prompt (ex. You are given a scientific article and a question. Answer the question....) from LongBench to the end of the input sequence, so it falls within the window range, might improve performance. Was there any such modification?
Following the SnapKV methodology, I expect the KV cache size to always be bounded by the max_capacity_prompt. Yet, why does an OOM error occur when exceeding a certain length? (131K at Sec 5.1) Could it be due to recalculating the attention weights in Line 9 of Listing 1?

Additionally, there seems to be a minor error in Figure 7 where both the top and bottom plots are labeled as "without Pooling." It might be less confusing to label the bottom plot as "with Pooling."

Thank you for any insights you can provide. I really appreciate the motivation and methodology behind your work!

expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min) RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3

I use transformers==4.27.0 ，why has this error？

RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3

Grouped query attention implementation

Thank you for your nice work and sharing code. Grouped query attention is used in Mistral and Mixtral models. However, I found the implementation in snapkv_utils.py is for multi-head attention, it may not be correct for grouped query attention.

Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

Can't not run longbench!

Here is my env. The version of transfomers is meet the requirements in monkeypatch.py

torch==2.2.0
transfomers==4.37.0

The traceback are as follows:

traceback

>> python pred_snap.py --model llama2-7b-chat-4k --compress_args_path ablation_c1024_w32_k7_maxpool.json

Traceback (most recent call last):
File "experiments/LongBench/pred_snap.py", line 321, in
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "experiments/LongBench/pred_snap.py", line 132, in get_pred_single_gpu
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1474, in generate
return self.greedy_search(
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2335, in greedy_search
outputs = self(
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
outputs = self.model(
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data1/ss/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1035, in forward
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 398, in _prepare_4d_causal_attention_mask_for_sdpa
expanded_4d_mask = attn_mask_converter.to_4d(
File "/data1/ss/anaconda3/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 137, in to_4d
expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3

I think the reason would be DynamicCache.get_usable_length conflict with the getting-casual-mask function _prepare_4d_causal_attention_mask_for_sdpa.

I would like to know how can I quick fix this. Thx :)

maybe a bug in `update_kv` function

SnapKV/snapkv_utils.py

Line 50 in ea655b1

attention_mask = mask[None, None, :, :]

In update_kv function, instead of using the function's arguments attention_mask, this variable is overridden.

why only decode do compress?

@leeyeehoo @ctlllll @WendyH1108

It seems that snapkv need to be able to do "prefill" at least once before the prompt can be compressed.

snapkv need a full len q, k matmul before its first self-attention, which is a $O(n^2)$ space complexity. So is snapkv need to be able to do "prefill" at least once before the prompt can be compressed?

after that it can save memory footprint during decoding phase.

   def update_kv(self, key_states, query_states, value_states, attention_mask, num_key_value_groups):
        # check if prefix phase
        assert key_states.shape[-2] == query_states.shape[-2]
        bsz, num_heads, q_len, head_dim = query_states.shape
        if q_len < self.max_capacity_prompt:
            return key_states, value_states
        else:
            attn_weights = torch.matmul(query_states[..., -self.window_size:, :], key_states.transpose(2, 3)) / math.sqrt(head_dim)

The effect of Clustering via Pooling may be greater？

Just a guess.

What will happen if H2O also uses Clustering via Pooling when comparing? It seems that Clustering via Pooling can improve the effectiveness of such drop token methods.