Code Monkey home page Code Monkey logo

spqr's Introduction

SpQR model compression

Note: This repository contains quantization algorithm and the model evaluation code for SpQR method for LLM compression; The efficient inference code will be added soon.

It accompanies the research paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression" .

Installation

Packages

To run SpQR with falcon make sure that you have torch>=2.0.0 with CUDA support.

Install packages from requirements.txt:

pip install -r requirements.txt

Note: the results reported in the ArXiv paper where obtained using 4.28.dev0 version of transformers, commit id 464d420775.

Loading / caching datasets and tokenizer

The script will require downloading and caching locally the relevant tokenizer and the datasets. They will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables. See relevant Datasets documentation section

Models

This repository is expected to work with models of LLaMA, Falcon and OPT families so far.

Data

For quantization with SpQR its is recommended to use the subset of the data model was trained on. I.e. for quantization of LLaMA models we recommend to use the subset of RedPajama and for Falcon quantization - RefinedWeb.Both subsets are stored in data directory:

  • data/red_pajama_n=1024.pth
  • data/refined_web_n=128.pth

Note These subsets are already processed with the corresponding model tokenizer. Use for different model will lead to unexpected behavior.

For OPT following GPTQ paper we recommend to use c4.

W&B logging

For the sake of convenience one can optionally log the data to Weights and Biases service (wandb). Run pip install wandb for W&B logging. Specify $WANDB_ENTITY, $WANDB_PROJECT, $WANDB_NAME environment variables prior to running experiments. use --wandb argument to enable logging

Launching

GPU and RAM requirements

This code was developed and tested using a single A100 GPU with 80GB GPU RAM. It may successfully run on GPUs with 32GB+ VRAM for perplexity evaluation of up to LLaMA-65B and Falcon-40B models. With --offload activations option, the model perplexity may be evaluated on machines with less VRAM: 24GB+ for Llama 65B and 6GB+ for Llama 7B. The perplexity testing code also requires RAM amount sufficient to hold uncompressed model weights (e.g. ~130GB for Llama65B) and testing datasets. For Language Model Evaluation Harness evaluation one needs to have enough memory to load whole model on one or several devices + activation tensors.

Model downloading

The code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that $TRANSFORMERS_CACHE variable points to the Huggingface Transformers cache folder.

Perplexity benchmarks:

This script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets.

The command to launch the script should look like this:

export MODEL_PATH=<PATH_TO_MODEL_DIR>
export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>

python main.py $MODEL_PATH $DATASET \
    --wbits 4 \
    --groupsize 16 \
    --perchannel \
    --qq_scale_bits 3 \
    --qq_zero_bits 3 \
    --qq_groupsize 16 \
    --outlier_threshold=0.2 \
    --permutation_order act_order \
    --percdamp 1e0 \
    --nsamples 128 

The command above runs near-lossless compression as described in the article. Adjusting the above parameters allows for tighter compression with a slightly greater loss.

Note the launch arguments:

  • <PATH_TO_MODEL_DIR> - path to model folder, which contains config.json
  • one of [c4, ptb, wikitext2, pajama, refinedweb, none] -- name of dataset to use for compression, or path to an alternative preprocessed and tokenized dataset.
  • --wbits 3 -- number of bits for quantized weights representation
  • --groupsize 16 -- size of first-order groups for compression
  • --qq_groupsize 16 -- size of second-order (quantized) groups for compression
  • --qq_scale_bits 3 --qq_zero_bits 3 -- bit sizes for quantizing first order weights' scale and zeros.
  • --offload activations -- moves activations to RAM when not used. Reduces VRAM usage while slowing work by ~10%. run python main.py --help for more details on command line arguments, including compression parameters.
  • --save --load -- path to save/load quantized model.

LM Evaluation Harness benchmark.

To perform zero-shot evaluation, we use Language Model Evaluation Harness framework with slight modifications. This repository contains a copy of LM Evaluation Harness repo from early 2023 in lm-eval-harness folder.

Installation

Before running the code make sure that you have all the requirements and dependencies of lm-eval-harness installed. To install them run:

pip install -r lm-evaluation-harness/requirements.txt

Execution

The main script launching the evaluation procedure is lmeval.py .

Note. Current version of the script support only LLaMA/Falcon quantization. Therefore, set:

  • --model=hf-causal
  • --model_args pretrained=$MODEL_PATH where $MODEL_PATH has to be one of the LLaMA models

--quantization_args - list of comma separated arguments for quantizer. For details and options refer to spqr_config.py.

Below is presented an example of benchmark launch.

export MODEL_PATH=<INSERT PATH_TO_MODEL_DIR>
export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>

python lmeval.py \
    --model hf-causal \
    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \
    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \
    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \
    --batch_size 1

Performance and runtime notes:

  • For large models (LLaMA-30B, LLaMA-65B) specify max_memory_per_gpu={value}GIB so that there are free 15-20GIB of GPU memory for each GPU to store activations for calibration.
  • offload_activations=True slightly reduces peak memory consumption
  • Typically LlaMA-30B requires 1-2 A100 GPUs with 80Gb of memory and LlaMA-65B requires 3 A100 with 80Gb each.
  • With enough spare GPU memory, one can raise batch size to accelerate evaluation process.

Citation

@misc{dettmers2023spqr,
      title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression}, 
      author={Tim Dettmers and Ruslan Svirschevski and Vage Egiazarian and Denis Kuznedelev and Elias Frantar and Saleh Ashkboos and Alexander Borzunov and Torsten Hoefler and Dan Alistarh},
      year={2023},
      eprint={2306.03078},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

spqr's People

Contributors

poedator avatar godofnothing avatar vahe1994 avatar justheuristic avatar erjanmx avatar ewouth avatar eltociear avatar

Stargazers

 avatar Zhou Chang avatar Sun2sky avatar  avatar Bing avatar Hyemin Lee avatar  avatar  avatar  avatar Justin Zhang avatar  avatar  avatar Rubén J.R. avatar  avatar Marcin Bielak avatar Shao Yuantian avatar pengshenhao avatar Wanyi Ning avatar  avatar liucc avatar  avatar JCDemon avatar Xinghao Wang avatar Constantine S avatar yszhao2004 avatar Steven Roussey avatar Adam Zell avatar  avatar Yi Liu avatar  avatar LeiLei Ding avatar Steve Lannister avatar Dibyanshu Shekhar avatar  avatar Chi-Chih Chang avatar Borui Xu avatar Yuko Hu avatar Pete Tanski avatar Zeik avatar Penut Chen avatar Roberto L. Castro avatar DaLime avatar Anirudh Dharmarajan avatar Haiyan avatar Jinyu Bai avatar Joseph Mearman avatar  avatar Mohammed OE Abdallah avatar  avatar  avatar Chi-kwan Chan avatar Yong Z avatar Mike Lasby avatar Matthew McAteer avatar  avatar TanBaby avatar Oleg avatar  avatar  avatar Harahan avatar Eric avatar Anton Potapov avatar Andrew Nicholson avatar Xiaozhe Yao avatar Xin Li avatar Miles Williams avatar Dr. Dominik Lindorfer avatar Davide Paglieri avatar Tenzin Singhay Bhotia avatar  avatar Etienne Balit avatar Chen Xupeng avatar  avatar 程序员道格 avatar  avatar pandora-alias avatar Rohan Paul avatar LI QIAO avatar Berat Çimen avatar Henry Lao avatar  avatar Daniel Buades Marcos avatar Shiyao Li avatar Dmitriy Nosov avatar Renat Zayashnikov avatar flame avatar  avatar machina avatar Ilyas Moutawwakil avatar Josiah S avatar 唐国梁Tommy avatar Pengrui Quan avatar Matthew Douglas avatar  avatar Shawn Harmsen avatar  avatar  avatar pravir avatar Adam Twardoch avatar Kareem I avatar

Watchers

Alex Nguyen avatar gradetwo avatar Artem Andreenko avatar Andrew Nicholson avatar  avatar  avatar Konstantin Paradizov avatar  avatar Daniel Organisciak avatar LOL avatar Dan Alistarh avatar  avatar  avatar  avatar  avatar Yotam avatar  avatar Rumen Mihaylov avatar Arsenii Shatokhin avatar  avatar  avatar

spqr's Issues

LLaMa 30B loading error

Hi, I'm trying to test this on the LLaMa 30b model, however I get the following error:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│                                                                              │
│ /home/user/SpQR/main.py:577 in <module>                                  │
│                                                                              │
│   574 │   device = "cuda" if torch.cuda.is_available() else "cpu"            │
│   575 │                                                                      │
│   576 │   print("============  Loading model... ============")               │
│ ❱ 577 │   model = get_model(args.model_path, args.load, args.dtype).train(Fa │
│   578 │                                                                      │
│   579 │   print("\n============ Quantizing model... ============")           │
│   580 │   if args.wbits < 16 and args.load:                                  │
│ /home/user/SpQR/modelutils.py:45 in get_model                            │
│                                                                              │
│    42 │   │   │   model = load_quantized_model(model, load_quantized)        │
│    43 │   │   else:                                                          │
│    44 │   │   │   print("Loading pretrained model ...")                      │
│ ❱  45 │   │   │   model = AutoModelForCausalLM.from_pretrained(              │
│    46 │   │   │   │   pretrained_model_name_or_path=model_path,              │
│    47 │   │   │   │   trust_remote_code=True,                                │
│    48 │   │   │   │   torch_dtype=dtype,                                     │
│                                                                              │
│ /home/user/.local/lib/python3.9/site-packages/transformers/models/auto/a │
│ uto_factory.py:467 in from_pretrained                                        │
│                                                                              │
│   464 │   │   │   )                                                          │
│   465 │   │   elif type(config) in cls._model_mapping.keys():                │
│   466 │   │   │   model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 467 │   │   │   return model_class.from_pretrained(                        │
│   468 │   │   │   │   pretrained_model_name_or_path, *model_args, config=con │
│   469 │   │   │   )                                                          │
│   470 │   │   raise ValueError(                                              │
│                                                                              │
│ /home/user/.local/lib/python3.9/site-packages/transformers/modeling_util │
│ s.py:2777 in from_pretrained                                                 │
│                                                                              │
│   2774 │   │   │   │   mismatched_keys,                                      │
│   2775 │   │   │   │   offload_index,                                        │
│   2776 │   │   │   │   error_msgs,                                           │
│ ❱ 2777 │   │   │   ) = cls._load_pretrained_model(                           │
│   2778 │   │   │   │   model,                                                │
│   2779 │   │   │   │   state_dict,                                           │
│   2780 │   │   │   │   loaded_state_dict_keys,  # XXX: rename?               │
│                                                                              │
│ /home/user/.local/lib/python3.9/site-packages/transformers/modeling_util │
│ s.py:3104 in _load_pretrained_model                                          │
│                                                                              │
│   3101 │   │   │   │   # Skip the load for shards that only contain disk-off │
│   3102 │   │   │   │   if shard_file in disk_only_shard_files:               │
│   3103 │   │   │   │   │   continue                                          │
│ ❱ 3104 │   │   │   │   state_dict = load_state_dict(shard_file)              │
│   3105 │   │   │   │                                                         │
│   3106 │   │   │   │   # Mistmatched keys contains tuples key/shape1/shape2  │
│   3107 │   │   │   │   # matching the weights in the model.                  │
│                                                                              │
│ /home/user/.local/lib/python3.9/site-packages/transformers/modeling_util │
│ s.py:444 in load_state_dict                                                  │
│                                                                              │
│    441 │   │   │   raise NotImplementedError(                                │
│    442 │   │   │   │   f"Conversion from a {metadata['format']} safetensors  │
│    443 │   │   │   )                                                         │
│ ❱  444 │   │   return safe_load_file(checkpoint_file)                        │
│    445 │   try:                                                              │
│    446 │   │   return torch.load(checkpoint_file, map_location="cpu")        │
│    447 │   except Exception as e:                                            │
│                                                                              │
│ /home/user/.local/lib/python3.9/site-packages/safetensors/torch.py:101   │
│ in load_file                                                                 │
│                                                                              │
│    98 │   result = {}                                                        │
│    99 │   with safe_open(filename, framework="pt", device=device) as f:      │
│   100 │   │   for k in f.keys():                                             │
│ ❱ 101 │   │   │   result[k] = f.get_tensor(k)                                │
│   102 │   return result                                                      │
│   103                                                                        │
│   104                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[6656, 6656]' is invalid for input of size 33697155

I am running the same command as in the README:

python3 main.py $MODEL_PATH $DATASET \
    --wbits 4 \
    --groupsize 16 \
    --perchannel \
    --qq_scale_bits 3 \
    --qq_zero_bits 3 \
    --qq_groupsize 16 \
    --outlier_threshold=0.2 \
    --permutation_order act_order \
    --percdamp 1e0 \
    --offload_activations \
    --nsamples 4

Any ideas how to fix this?

None type error occured while quantizing the llama2 model

Hi,
Thanks for your wonderful work. I'm trying to use SpQR(./run.sh) to quantize a llama2-7b model. But it always showed a NoneType error:
def apply_rotary_pos_emb(q, k, cos, sin, position_ids): gather_indices = position_ids[:, None, :, None] # [bs, 1, seq_len, 1] gather_indices = gather_indices.repeat(1, cos.shape[1], 1, cos.shape[3]) cos = torch.gather(cos.repeat(gather_indices.shape[0], 1, 1, 1), 2, gather_indices) sin = torch.gather(sin.repeat(gather_indices.shape[0], 1, 1, 1), 2, gather_indices) q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed

gather_indices = position_ids[:, None, :, None] # [bs, 1, seq_len, 1]
TypeError: 'NoneType' object is not subscriptable.

I was wondering how can I solve this problem. If you can provide any help, I will really appreciate it.

Best regards,
Lucas

Why no save?

HI, Thanks for sharing this great quantization technique

But I am not sure I understand why saving is not supported at this moment, in the main.py

    if args.save or args.save_safetensors:
        raise NotImplementedError()

What are the obstacles that prevent compressed models from being saved/reload?

Do you know any techniques that allow dumping the model VRAM to disk and reload from disk directly?

Evaluation code for Falcon models

Hi Authors,

Thank you for sharing the evaluation code for llama models. Could you please release the code for the evaluation of Falcon models?

Best regards,
Abdelrahman.

Outlier mask is still permuted when returned

unstructured_outlier_mask is not reversely permuted when returned from SPQRUtil.quantize(). Consequently, when comparing the outlier mask and the quantised weights, the positions of the outliers do not correspond.

Is there any reason for this? From what I saw it doesn't affect the functionality of the program whatsoever, but caused me some headaches when studying and debugging the code

Reason for permutation and weights after it is inverted

  1. Why is weight permutation used in the code and is it mentioned in the paper?
  2. I looked at layer weights after the quantization and they are supposed to have a certain pattern. torch.unique(layer.weight.data[idx,:blocksize]) for any idx should output not more than 2^bit values for any quantization. It works for the original GPTQ code and it also works if I use identity permutation in your code, but it doesn't work for other permutation options and they consistently have blocksize number of values instead of 2^bit. Am I missing something? Outliers can potentially contribute to the total number of unique values, but there cannot be blocksize-2^bit of them (why quantize at all then). Are you sure that the weight matrix is reconstructed correctly?

Which dataset should I use?

Hello, I have a question, I currently have a model of the llama series that has been fine-tuned with my own dataset. If I want to SpQR quantize it, do I use data/red_pajama_n=1024.pth for the parameter as well? Or do I use my own dataset that I used for fine-tuning?
Looking forward to getting your response!

model downloading

Hi!

Thanks for sharing this awesome work!

One question : where can I download the LLaMA in huggingface format so i can be able to run the commands?

Thanks!

Doesn't seem to work for Baichuan-7B

Hi, @Vahe1994.
It is so kind of you to release such a great work!
I had applied the SpQR with Baichuan-7B (whose network structure is the same as LlaMa-7B, except that the number of tokens in embedding & lm_head layer is twice that of LlaMa-7B) and found that the outlier_threshold should be tuned quite high (i.e., 3.0) to achieve the fraction of outliers (nearly 1%) recommended by your paper. However, after tuning that, the average score on C-Eval val set was dropped drastically, specifically, from 38.5 to 23.0 (following the official evaluated script and used zero-shot).
I will be very appreciated that if you could help us to understand this issue.
Thank you so much!

CUDA out of memory falcon-40b when using 40Gi A100 GPU

Been trying to run quantization for falcon-40b on a box with 8 40Gi A100's but I keep getting CUDA memory errors. The readme states that this should be possible, unless I'm misreading this line:

It may successfully run on GPUs with 32 - 40GB for perplexity evaluation of up to LLaMA-65B and Falcon-40B models.

Here's the command I'm running

python main.py falcon_model/models--tiiuae--falcon-40b/snapshots/c47b371b31a68349c233104050ac76680b8485db custom \
  --custom_data_path=data/refined_web_n=128.pth \
  --wbits 4 \
  --groupsize 16 \
  --perchannel \
  --qq_scale_bits 3 \
  --qq_zero_bits 3 \
  --qq_groupsize 16 \
  --outlier_threshold=0.2 \
  --permutation_order act_order \
  --percdamp 1e0 \
  --nsamples 128

Here's the full command output:

/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
============  Loading model... ============
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            ubuntu
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           ubuntu
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:47<00:00,  5.23s/it]

============ Quantizing model... ============
Loading data ...

Starting SPQR quantization ...
catching inputs from data

---------------- Layer 0 of 60 ----------------
layer_dev_original=device(type='cpu')
Quantizing module self_attention.query_key_value of layer 0                                                                                                                                                                                                                              
Quantizing module self_attention.dense of layer 0                                                                                                                                                                                                                                        
Quantizing module mlp.dense_h_to_4h of layer 0                                                                                                                                                                                                                                           
Quantizing module mlp.dense_4h_to_h of layer 0                                                                                                                                                                                                                                           
Traceback (most recent call last):
  File "main.py", line 549, in <module>
    quantize_model(model, args, device)
  File "main.py", line 73, in quantize_model
    results = quantize_spqr(model, dataloader, args, device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "main.py", line 217, in quantize_spqr
    quantized = spqr_handlers[sublayer_name].quantize(
  File "/home/ubuntu/SpQR/spqr_engine.py", line 84, in quantize
    H = H[perm][:, perm]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 39.56 GiB total capacity; 33.54 GiB already allocated; 2.80 GiB free; 35.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there something I'm doing wrong when launching the command?

[ptb perplexity is different from paper]

Hi,

Thank you for sharing your work.

The re-produced perplexity for ptb dataset using your code is not matched with the paper. The reproduced is 27.8, while in the paper it is around 9. Please clarify that.

image

Process killed after eval phase

I did try running the code from your repository; however, when I attempted to add the --save_safetensors feature, the process was interrupted after performing the evaluation. I didn't encounter any issues regarding permissions or memory problems. Do you think there might be an issue here?

Provide SpQR trained model weights on OpenLLaMA?

Hi

I was wondering if you folks can provide SpQR trained model weights on OpenLLaMA?

OpenLLaMA has Apache-2.0 and has reported closer to the original LLaMA’s performance on benchmarks.

Thanks. (And awesome research btw 😊)

Post Quantization for nllb-models

Hi @Vahe1994,

I have fine-tuned a facebook's nllb model on my custom dataset for language translation. Could you provide a guideline on how to preform SpQR of this fine-tuned model? Specifically, I am interested in post-quantization methodologies.

Thanks in advance and great work implementing SpQR

Does permutation order have to be included when saving the quantized model?

I understand model saving is yet to be implemented, but it looks like permutation may increase the memory footprint of the model.

If we save an SpQR-quantized model in a file and try to dequantize it, we'll end up with a permuted version of the weight matrices (in floating points). So, to use it in inference, it would need to be de-permuted.

Is there any other way of doing inference in SpQR without having to save the permutation order?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.