Code Monkey home page Code Monkey logo

llm-kick's Introduction

[ICLR 2024] Compressing LLMs: The Truth is Rarely Pure and Never Simple


This code is a reproduced (unofficial) version of the work done during the internship at Apple.

Authors: Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

Paper Link: https://arxiv.org/abs/2310.01382


Overview

image

Note

  1. We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities.
  2. Some of our key observations include: (a) all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; (b) current quantization methods are more successful than pruning; (c) pruned LLMs even at ≥ 50% sparsity are robust in-context retrieval and summarization systems.
  3. LLM-KICK is designed to holistically access compressed LLMs’ ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization.

Update

  • (02.06.2024) We released the code for LLM-KICK - Supports Vicuna 7B, 13B, 30B, 65B.
  • (02.08.2024) We provided support for GPTQ-related experiments.

Installation


Step 1: Clone this repository and navigate to llm_kick folder

git clone https://github.com/apple/llm_kick
cd llm_kick

Step 2: Create the conda environment:

conda create -n llm_kick python=3.9
conda activate llm_kick

Step 3: Install relevant packages:

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install transformers==4.28.0 datasets==2.11.0 wandb sentencepiece
pip install accelerate==0.18.0
pip install shortuuid tqdm

Usage


We provide a quick overview of the important arguments:

  • --model: The identifier for the Vicuna/LLaMa model on the Hugging Face model hub.
  • --cache_dir: Directory for loading or storing LLM weights. The default is llm_weights.
  • --prune_method: Pruning methods, namely [magnitude, wanda, sparsegpt].
  • --sparsity_ratio: Denotes the percentage of weights to be pruned.
  • --sparsity_type: Denotes the sparsity type: structured/unstructured [unstructured, 4:8, 2:4, 1:2]
  • --num_examples: Specifies the number of examples you want to conduct the evaluation.
  • --nsamples : Specifies the number of calibration samples to use during pruning with Wanda and SparseGPT.
  • --ntrain: Denotes the number of in-context training examples.
  • --include_context: Denotes if we want to use context knowledge in ICR-QA setting.

Script example of Factoid QA experiments

CUDA_VISIBLE_DEVICES=0 python main_freebase.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --num_examples 10

CUDA_VISIBLE_DEVICES=0 python main_freebase.py \
    --model_type vicuna7b \
    --prune_method wanda \
    --sparsity_ratio 0.20 \
    --sparsity_type 2:4 \
    --num_examples 10 

Script example of MCR-QA experiments

CUDA_VISIBLE_DEVICES=0 python main_mmlu.py \
    --model_type vicuna7b \
    --prune_method wanda \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured 

Script example of ICR-QA experiments

  • Script for Closed-Book QA
CUDA_VISIBLE_DEVICES=0 python main_icrqa.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --num_examples 10 
  • Script for Open-Book QA
CUDA_VISIBLE_DEVICES=0 python main_icrqa.py \
    --model_type vicuna7b \
    --prune_method magnitude \
    --sparsity_ratio 0.20 \
    --sparsity_type unstructured \
    --include_context \
    --num_examples 10

Script example of Text Summarization experiments

  1. Add your OPEN_AI KEY at line 1 and 2 to get the GPT-3.5 reference summary and initiate GPT-4 Judge.

  2. Script for generating the GPT 3.5 Summary

    python gpt35_openai.py --question ./json_utils/new_question.jsonl --output ./gpt35_answer.jsonl
    
  3. Script for generating the Compressed Model Summary

    CUDA_VISIBLE_DEVICES=0 python compressed_model.py \
     --model_id lmsys/vicuna-7b-v1.3 \
     --prune_method magnitude \
     --sparsity_ratio 0.20 \
     --sparsity_type unstructured \
     --question-file ./table/new_question.jsonl \
     --nsamples 128
    
  4. Script for running GPT-4 Judge

    CUDA_VISIBLE_DEVICES=0 python gpt4_judge.py --prune_method magnitude --sparsity_ratio 0.20
    

Script example of Instruction Following experiments

  1. Add your OPEN_AI KEY at line 1 and 2 to get the GPT-3.5 reference response to multi-turn conversation questions and initiate GPT-4 Judge respectively.

  2. Script for generating the GPT 3.5 Responses.

    python gpt35_openai.py --question ./json_utils/question.jsonl --output ./answer.jsonl
    
  3. Script for generating the Compressed Model Responses.

    CUDA_VISIBLE_DEVICES=0 python free_form_answer_prune.py \
     --model_id lmsys/vicuna-7b-v1.3 \
     --prune_method magnitude \
     --sparsity_ratio 0.20 \
     --sparsity_type unstructured \
     --question-file ./table/question.jsonl \
     --nsamples 128
    
  4. Script for running GPT-4 Judge.

    CUDA_VISIBLE_DEVICES=0 python gpt4_judge.py --prune_method magnitude --sparsity_ratio 0.20
    

Script example for Quantization Experiments

Script to generate the quantized models:

[1] Vicuna 7B:   python llama.py --wbits 8 --save <SAVE_LOCATION>/vicuna7b-8bit-128g.pt --model lmsys/vicuna-7b-v1.3
[2] Vicuna 13B:  python llama.py --wbits 8 --save <SAVE_LOCATION>/vicuna13b-8bit-128g.pt --model lmsys/vicuna-13b-v1.3

Script to test the quantized model on Factoid-QA:

CUDA_VISIBLE_DEVICES=0 python vicuna_inference_freebase.py --wbits 8 --load <SAVE_LOCATION>/vicuna13b-8bit-128g.pt --model lmsys/vicuna-13b-v1.3

Acknowledgement

This repository is built upon the Wanda, SparseGPT, and GPTQ repositories.

More details coming soon!

Citation

If you find this repository is helpful, please cite:

@article{jaiswal2023compressing,
  title={Compressing llms: The truth is rarely pure and never simple},
  author={Jaiswal, Ajay and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zhangyang and Yang, Yinfei},
  journal={arXiv preprint arXiv:2310.01382},
  year={2023}
}

llm-kick's People

Contributors

ajay1994 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

pprp

llm-kick's Issues

Discrepancy Between Reported and Obtained Results

Hi,

I tried to reproduce the most simple setting: Vicuna-7B quantized using GPTQ on the Factoid-based QA and here are the results I obtain:

Benchmarks/Models Vicuna-7B-Original16bit Vicuna-7B-GPTQ8bit Vicuna-7B-GPTQ4bit
Factoid-based QA 60.54% 60.04% (-0.5%) 45.95% (-14.59%)

Which is very different from the results below deduced from the paper:

Benchmarks/Models Vicuna-7B-Original16bit Vicuna-7B-GPTQ8bit Vicuna-7B-GPTQ4bit
Factoid-based QA ? ? (~ -7%) ? (~ -36%)

I have attached the test outputs for the three runs.

freebase_16bits_128_group.txt

freebase_8bits_128_group.txt

freebase_4bits_128_group.txt

Could you please provide the exact process to reproduce your results?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.