[ICLR 2024] Compressing LLMs: The Truth is Rarely Pure and Never Simple

This code is a reproduced (unofficial) version of the work done during the internship at Apple.

Authors: Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

Paper Link: https://arxiv.org/abs/2310.01382

Overview

Note

We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities.
Some of our key observations include: (a) all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; (b) current quantization methods are more successful than pruning; (c) pruned LLMs even at ≥ 50% sparsity are robust in-context retrieval and summarization systems.
LLM-KICK is designed to holistically access compressed LLMs’ ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization.

Update

(02.06.2024) We released the code for LLM-KICK - Supports Vicuna 7B, 13B, 30B, 65B.
(02.08.2024) We provided support for GPTQ-related experiments.

Installation

Step 1: Clone this repository and navigate to llm_kick folder

git clone https://github.com/apple/llm_kick
cd llm_kick

Step 2: Create the conda environment:

conda create -n llm_kick python=3.9
conda activate llm_kick

Step 3: Install relevant packages:

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install transformers==4.28.0 datasets==2.11.0 wandb sentencepiece
pip install accelerate==0.18.0
pip install shortuuid tqdm

Usage

We provide a quick overview of the important arguments:

--model: The identifier for the Vicuna/LLaMa model on the Hugging Face model hub.
--cache_dir: Directory for loading or storing LLM weights. The default is llm_weights.
--prune_method: Pruning methods, namely [magnitude, wanda, sparsegpt].
--sparsity_ratio: Denotes the percentage of weights to be pruned.
--sparsity_type: Denotes the sparsity type: structured/unstructured [unstructured, 4:8, 2:4, 1:2]
--num_examples: Specifies the number of examples you want to conduct the evaluation.
--nsamples : Specifies the number of calibration samples to use during pruning with Wanda and SparseGPT.
--ntrain: Denotes the number of in-context training examples.
--include_context: Denotes if we want to use context knowledge in ICR-QA setting.

Script example of Factoid QA experiments

CUDA_VISIBLE_DEVICES=0 python main_freebase.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --num_examples 10

CUDA_VISIBLE_DEVICES=0 python main_freebase.py \
    --model_type vicuna7b \
    --prune_method wanda \
    --sparsity_ratio 0.20 \
    --sparsity_type 2:4 \
    --num_examples 10

Script example of MCR-QA experiments

CUDA_VISIBLE_DEVICES=0 python main_mmlu.py \
    --model_type vicuna7b \
    --prune_method wanda \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured

Script example of ICR-QA experiments

Script for Closed-Book QA

CUDA_VISIBLE_DEVICES=0 python main_icrqa.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --num_examples 10

Script for Open-Book QA

CUDA_VISIBLE_DEVICES=0 python main_icrqa.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --include_context \
    --num_examples 10

Script example of Text Summarization experiments

Add your OPEN_AI KEY at line 1 and 2 to get the GPT-3.5 reference summary and initiate GPT-4 Judge.

Script for generating the GPT 3.5 Summary

python gpt35_openai.py --question ./json_utils/new_question.jsonl --output ./gpt35_answer.jsonl

Script for generating the Compressed Model Summary

CUDA_VISIBLE_DEVICES=0 python compressed_model.py \
 --model_id lmsys/vicuna-7b-v1.3 \
 --prune_method magnitude \
 --sparsity_ratio 0.20 \
 --sparsity_type unstructured \
 --question-file ./table/new_question.jsonl \
 --nsamples 128

Script for running GPT-4 Judge

CUDA_VISIBLE_DEVICES=0 python gpt4_judge.py --prune_method magnitude --sparsity_ratio 0.20

Script example of Instruction Following experiments

Add your OPEN_AI KEY at line 1 and 2 to get the GPT-3.5 reference response to multi-turn conversation questions and initiate GPT-4 Judge respectively.

Script for generating the GPT 3.5 Responses.

python gpt35_openai.py --question ./json_utils/question.jsonl --output ./answer.jsonl

Script for generating the Compressed Model Responses.

CUDA_VISIBLE_DEVICES=0 python free_form_answer_prune.py \
 --model_id lmsys/vicuna-7b-v1.3 \
 --prune_method magnitude \
 --sparsity_ratio 0.20 \
 --sparsity_type unstructured \
 --question-file ./table/question.jsonl \
 --nsamples 128

Script for running GPT-4 Judge.

CUDA_VISIBLE_DEVICES=0 python gpt4_judge.py --prune_method magnitude --sparsity_ratio 0.20

Script example for Quantization Experiments

Script to generate the quantized models:

[1] Vicuna 7B:   python llama.py --wbits 8 --save <SAVE_LOCATION>/vicuna7b-8bit-128g.pt --model lmsys/vicuna-7b-v1.3
[2] Vicuna 13B:  python llama.py --wbits 8 --save <SAVE_LOCATION>/vicuna13b-8bit-128g.pt --model lmsys/vicuna-13b-v1.3

Script to test the quantized model on Factoid-QA:

CUDA_VISIBLE_DEVICES=0 python vicuna_inference_freebase.py --wbits 8 --load <SAVE_LOCATION>/vicuna13b-8bit-128g.pt --model lmsys/vicuna-13b-v1.3

Acknowledgement

This repository is built upon the Wanda, SparseGPT, and GPTQ repositories.

More details coming soon!

Citation

If you find this repository is helpful, please cite:

@article{jaiswal2023compressing,
  title={Compressing llms: The truth is rarely pure and never simple},
  author={Jaiswal, Ajay and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zhangyang and Yang, Yinfei},
  journal={arXiv preprint arXiv:2310.01382},
  year={2023}
}

vita-group / llm-kick Goto Github PK

llm-kick's Introduction

[ICLR 2024] Compressing LLMs: The Truth is Rarely Pure and Never Simple

Overview

Update

Installation

Usage

Script example of Factoid QA experiments

Script example of MCR-QA experiments

Script example of ICR-QA experiments

Script example of Text Summarization experiments

Script example of Instruction Following experiments

Script example for Quantization Experiments

Acknowledgement

Citation

llm-kick's People

Contributors

Stargazers

Watchers

Forkers

llm-kick's Issues

Recommend Projects

Recommend Topics

Recommend Org