locuslab / wanda Goto Github PK

View Code? Open in Web Editor NEW

577.0 9.0 69.0 112 KB

A simple and effective LLM pruning approach.

Home Page: https://arxiv.org/abs/2306.11695

License: MIT License

Python 98.20% Shell 1.80%

large-language-models network-pruning

wanda's Introduction

Pruning LLMs by Weights and Activations

Official PyTorch implementation of Wanda (Pruning by Weights and activations), as presented in our paper:

A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun*, Zhuang Liu*, Anna Bair, J. Zico Kolter (* indicates equal contribution)
Carnegie Mellon University, Meta AI Research and Bosch Center for AI
Paper - Project page

@article{sun2023wanda,
  title={A Simple and Effective Pruning Approach for Large Language Models}, 
  author={Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J. Zico},
  year={2023},
  journal={arXiv preprint arXiv:2306.11695}
}

Compared to magnitude pruning which removes weights solely based on their magnitudes, our pruning approach Wanda removes weights on a per-output basis, by the product of weight magnitudes and input activation norms.

Update

(9.22.2023) Add support for LLaMA-2.
(9.22.2023) Add code to reproduce the ablation study on OBS weight update in the paper.
(10.6.2023) Add new support for the weight update analysis in the ablation study. Feel free to try it out!
(10.6.2023) Add support for zero-shot evaluation.
(10.20.2023) Add code for pruning OPT models.
(10.23.2023) Add code for LoRA fine-tuning.

Setup

Installation instructions can be found in INSTALL.md.

Usage

The scripts directory contains all the bash commands to replicate the main results (Table 2) in our paper.

Below is an example command for pruning LLaMA-7B with Wanda, to achieve unstructured 50% sparsity.

python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/

We provide a quick overview of the arguments:

--model: The identifier for the LLaMA model on the Hugging Face model hub.
--cache_dir: Directory for loading or storing LLM weights. The default is llm_weights.
--prune_method: We have implemented three pruning methods, namely [magnitude, wanda, sparsegpt].
--sparsity_ratio: Denotes the percentage of weights to be pruned.
--sparsity_type: Specifies the type of sparsity [unstructured, 2:4, 4:8].
--use_variant: Whether to use the Wanda variant, default is False.
--save: Specifies the directory where the result will be stored.

For structured N:M sparsity, set the argument --sparsity_type to "2:4" or "4:8". An illustrative command is provided below:

python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type 2:4 \
    --save out/llama_7b/2-4/wanda/

Pruning LLaMA-2

For LLaMA-2 models, replace --model with meta-llama/Llama-2-7b-hf (take 7b as an example):

python main.py \
    --model meta-llama/Llama-2-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama2_7b/unstructured/wanda/

LLaMA-2 results: (LLaMA-2-34b is not released as of 9.22.2023)

sparsity	ppl	llama2-7b	llama2-13b	llama2-70b
-	dense	5.12	4.57	3.12
unstructured 50%	magnitude	14.89	6.37	4.98
unstructured 50%	sparsegpt	6.51	5.63	3.98
unstructured 50%	wanda	6.42	5.56	3.98
4:8	magnitude	16.48	6.76	5.58
4:8	sparsegpt	8.12	6.60	4.59
4:8	wanda	7.97	6.55	4.47
2:4	magnitude	54.59	8.33	6.33
2:4	sparsegpt	10.17	8.32	5.40
2:4	wanda	11.02	8.27	5.16

Ablation on OBS weight update

To reproduce the analysis on weight update, we provide our implementation for this ablation. All commands can be found in this script.

for method in ablate_mag_seq ablate_wanda_seq ablate_mag_iter ablate_wanda_iter 
do 
CUDA_VISIBLE_DEVICES=0 python main.py \
  --model decapoda-research/llama-7b-hf \
  --sparsity_ratio 0.5 \
  --sparsity_type unstructured \
  --prune_method ${method} \
  --save out/llama_7b_ablation/unstructured/
done

Here ablate_{mag/wanda}_{seq/iter} means that we use magnitude pruning or wanda to obtain the pruned mask at each layer, then apply weight update procedure with either a sequential style or an iterative style every 128 input channels. For details, please see Section 5 of our paper.

Zero-Shot Evaluation

For evaluating zero-shot tasks, we modify the EleutherAI LM Harness framework so that it could evaluate pruned LLM models. We provide the modified repo in this link. Make sure to download, extract and install this custom lm_eval package from the source code.

For reproducibility, we used commit df3da98 on the main branch. All tasks were evaluated on task version of 0 except for BoolQ, where the task version is 1.

On a high level, the functionality we provide is adding two arguments pretrained_model and tokenizer in this function. We can then call this simple_evaluate function API from our codebase to evaluate sparse pruned LLMs. To evaluate zero-shot tasks in addition to the WikiText perplexity, pass the --eval_zero_shot argument.

Speedup Evaluation

The pruning speed for each method is evaluated by the cumulated time spent on pruning (for each layer), without the forward passes.

For inference speedup with structured sparsity, we refer the reader to this blog post, where structured sparsity is supported by PyTorch >= 2.1. You can switch between the CUTLASS or CuSPARSELt kernel here.

Last, for pruning image classifiers, see directory image_classifiers for details.

Acknowledgement

This repository is build upon the SparseGPT repository.

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Questions

Feel free to discuss papers/code with us through issues/emails!

mingjies at cs.cmu.edu
liuzhuangthu at gmail.com

wanda's People

Contributors

Stargazers

Watchers

Forkers

jjhw coffeevampir3 codeaudit eltociear stjordanis evdcush if001 aryan21129 jefedeoro techthiyanes becomeallan bharat-runwal dglickmanai an-yongqi ssahgal sbwww vital121 shiweiliuiiiiiii rocktimjyotidas luuyin 782309745 arkizh yujiepan-work csric pkulium yonghuazhang-buaa hoangcongduc gchrysostomou ispanicus ldery amit-eee discordianbelle zrayzzz aliireza crystaleye42 caocongfeng harold-qiu aramachandran2000 frankinwi njuhugn huismiling compressionorg simlaharma zackfred07 caroljoyv abelkidanehaile tezhang65 sasgkhgw suhib97 angelalita changhaowu choprahetarth emilymsong josephccn rsong0606 mit10000 junyuanm khasmamad99 optiml-z samarthramesh omarelayat lebeausc cocochan55 ironicbo creative-emporium

wanda's Issues

Question about the latency speedup!

Hi,

Thanks for the great work!
I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?

Thanks,
Yang

run 70b error:RuntimeError: shape '[1, 4096, 64, 128]' is invalid for input of size 4194304

After setting up the environment as instructed, I successfully pruned llama2-7b using Wanda without any issues. However, when attempting to prune llama2-70b, the following error occurred:
Traceback (most recent call last):
File "/home/jovyan/projects/BYD/wanda-main/main.py", line 110, in
main()
File "/home/jovyan/projects/BYD/wanda-main/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/home/jovyan/projects/BYD/wanda-main/lib/prune.py", line 160, in prune_wanda
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
File "/opt/conda/envs/prune_llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
[...]
File "/opt/conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 197, in forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
run command：
python main.py
--model ../weights/llama-2-70b-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type unstructured
--save ../weights/wanda/
--save_model ../weights/wanda_70b/
Could you please help me understand why this error occurred? Do I need to upgrade the environment, especially the transformer library? Your assistance is appreciated.

this method can be used to bloom model?

Some questions about the codes.

Thanks for your simple but efficient work for pruning.
I am reading your code, but I have some questions.

Q1: I think that the class WrappedGPT in lib/layerwrapper.py has a function of LayerNorm?

Q2: I don't understand the following implementation in lib/prune.py, could you give some explanations?

150        def add_batch(name):
151            def tmp(_, inp, out):
152                wrapped_layers[name].add_batch(inp[0].data, out.data)
153            return tmp
154
155        handles = []
156        for name in wrapped_layers:
157            handles.append(subset[name].register_forward_hook(add_batch(name)))
158        for j in range(args.nsamples):
159            with torch.no_grad():
160                outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
161        for h in handles:
162            h.remove()

...

204        for j in range(args.nsamples):
205            with torch.no_grad():
206                outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
207        inps, outs = outs, inps

These computations are equal to the original forward pass computations in transformers?

OPT-66B, unstructured sparsity gets wikitext perplexity 3404.0751953125

Hello, I used the scripts to prune the OPT-66B. (Unstructured, n_samples 128)
Upon with, I get a wikitext perplexity of 3404, which is way off the metric given in the paper.

I was wondering if the code output metric should be scaled by 0.01, (thus 3.404 perplexity)
Or if this is an outlier performance.

Some questions.

Hi, please answer a few questions.

Can I prune a model combined with LORA?

Can I prune the model and then use LORA?

Is it possible to prune Falcon 7b and 40b models?

Can you place ready-made models?

Is it possible to compress models after prune?

Issue with Mixed Device Tensors (cuda:0 and cuda:1)

Hello,

I've encountered a runtime error while working with your project, and I believe it might be related to the accelerate library. The error message is as follows:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

To provide some context, I have followed the installation instructions in your install.md file and have installed the same versions of accelerate, transformers, and your custom lm-eval module as specified. My environment setup matches the recommended configurations.

I suspect this issue might stem from a potential incompatibility or a configuration problem within the accelerate library, as it's responsible for managing the device placement of tensors.

Do you have any insights or potential solutions for this problem? It would be greatly appreciated if you could provide guidance on how to ensure all tensors are allocated on the same device, or if there's a specific configuration step that I might have missed.

Is this work for other models, such as GPT-2?

Thanks

How to use sparseGPT to prune the output dimension?

How to use sparseGPT to prune the output dimension? When I was calculating the Hessian matrix, the input dimension did not match the Hessian matrix dimension

Can wanda be used to ConvNet for CV tasks?

Hello. Thanks for sharing your great work.

Appendix A states "we only prune the linear layers for ConvNeXt." I understand the motivation of wanda builds upon the outliers observed in LLM, where there are no Conv layers. However, for ConvNet (e.g., ResNet) in CV tasks, the effect of outliers might not be as significant as that in LLM. I wonder whether wanda can achieve a good performance for ConvNet without retraining.

Thank you.

llama_7b wikitext perplexity 7.0915350914

bash scripts/llama_7b.sh

the source model: wikitext perplexity is 5.67702

prune this model, sparsity 50%, get wikitext perplexity is 7.09153509

but the paper is : 50% 7.26

why?

Structured 2:4 sparsity pattern supporting GPU

Hello. Thanks for sharing your great works.

I want to see how much faster my model is to when pruned with a structured 2:4 sparsity pattern rather than regular pruning on GPU A6000.
In the paper, looking at table 2, I can compare the wikitext validation perplexity performance with the normal 50% sparsitifed model by applying 2:4 and 4:8 sparsity patterns.
As far as I know, to use the 2:4 sparsity pattern, I need to use a GPU with a sparse tensor core(such as the A100 GPU).
I was wondering which GPU did you use for this performance evaluation and is it possible to do it with the A6000 GPU?
If I'm misunderstanding something, I'd appreciate it if you could point it out.

Thank you.

Compressing a Finetuned llama2 model with lora

Thank you for this amazing work. I was wondering if it was possible to run wanda on a llama2 model fine-tuned with lora? When I gave it a try, I got the following error:

AttributeError: 'LlamaForCausalLM' object has no attribute 'layers'

can not reproduce the results of figure 3

Dear author:
When I try to repeat the results of figure 3 for the effect of calibration set size, I get 12.57 for Llama-7B if I set the sample size =1. I only change the following command to set the default value 1, and nothing else is changed. Could you help check this?
parser.add_argument('--nsamples', type=int, default=1, help='Number of calibration samples.')

Where is cache['attention_mask']?

When using the sparsegpt pruning method, the following code is used, but kwargs['attention_mask'] appears to be None, resulting in a subsequent error.
class Catcher(nn.Module):
def init(self, module):
super().init()
self.module = module
def forward(self, inp, **kwargs):
inps[cache['i']] = inp
cache['i'] += 1
cache['attention_mask'] = kwargs['attention_mask']
cache['position_ids'] = kwargs['position_ids']
raise ValueError

What are the dependency of this project? What's the version of transformers?

Falcon 7b / 40b

Since llama isn't fully open source, are there current support or plans to also prune Falcon? Here is the huggingface link https://huggingface.co/tiiuae/falcon-7b

question about the code in the sparse_trainer.py

I'm running the code for sparse_trainer.py, it raises an error:

AttributeError: 'SparseTrainer' object has no attribute 'do_grad_scaling'.

I checked the code, and the 'do_grad_scaling' is only defined in the trainer.py in the dense_ft folder, but weird, you import trainer from transformers.trainer, which does not contain 'do_grad_scaling'.

I wonder which trainer class should be the right one? Could you open-source the dense finetune code for the sparse_trainer.py?

Questions about sub-networks of LLMs

Hello~. I am reading your paper and notice that you have mentioned lots of times that "exact and effective sparse sub-networks exist for LLMs". But I am a little confused and do not get it, your pruning method would behave differently depending on the input activations, which suggests it is a dynamic process all the time. So, could you please help me understand that? I appreciate that if you can help~

Some question about the code

Hi! Thanks for your great work!

Im a little confused about the implementation. Your simple and efficient work only requires once forward caculate to get the activation of each layer. This line seems means that the forward caculate is excuted in sparse network, which also means the input of next layer is caculated by the current sparse layer because current layer is already masked.

Im wondering whether the "masked forward" is necessary, and I notice that if this operation is canceled, the result can be better in some conditions.

Support for LLaMA-2

I couldn't reach 'allenai/c4' on the Hub.

Fine-tune the pruned model

I noticed that in the recent version, the Full Parameter Fine-tuning was added. How to fine-tune the Wanda pruned model?

Perplexity is off for Llama 2-7b

Hello,
I hope this finds you well.

I was trying to prune Llama 2-7b with wanda (cloned directly from your codebase), so I ran the following command:
python main.py --model meta-llama/Llama-2-7b-hf --prune_method wanda --sparsity_ratio 0.5 --sparsity_type unstructured --save out/llama2_7b/unstructured/wanda/

but I get a perplexity of 10.27 which is way higher than what you guys are reporting. It is being pruned with c4 and tested on wikitext2 (I changed nothing in the codebase). Do you guys maybe have a guess on what I might be doing wrong?

TIA

Need Clarification regarding prune_deit() in - https://github.com/locuslab/wanda/blob/main/image_classifiers/main.py

Hi Team,

in the image classifier code, ( https://github.com/locuslab/wanda/blob/main/image_classifiers/main.py)

in line number 324 to 332,

#############################################################################
tick = time.time()
if args.sparsity != 0:
with torch.no_grad():
if "convnext" in args.model:
prune_convnext(args, model, calib_data, device)
elif "vit" in args.model:
prune_vit(args, model, calib_data, device)
elif "deit" in args.model:
prune_vit(args, model, calib_data, device)
###########################################################################
in line number 329 and 331 conditions both the time you are calling prune_vit(). why we are not calling prune_deit() even though we have imported that in the from prune_utils import prune_convnext, prune_deit, prune_vit, check_sparsity line.

regards

Pruned model is same size as original

Great work on the project, really excited to see the outcomes.

However, After running the script below, the pruned model (output) seems to be of the same size as the original one (which is 6.38G)

!python /content/wanda/main.py
--model openlm-research/open_llama_3b_v2
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type unstructured
--save_model out/pruned
--save out/open_llama_3b_v2/unstructured/wanda/

Is this correct, or am I missing something?!

gpu memory size recommended for pruning the llama2-7b-chat-hf model

Great work team!

Currently, I am pruning on the llama2-7b-chat-hf model from hugging face.

python main.py

--model NousResearch/Llama-2-7b-chat-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type 2:4
--save out/llama_7b-chat-hf/structured/wanda/

got this error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 11.69 MiB is free. Including non-PyTorch memory, this process has 21.98 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Cannot load the c4 dataset

Hello,
I tried many things to be able to load the c4 dataset but I keep getting new errors. I already ran pip install -U datasets and pip install -U transformers. It didn't work. I wrote all the other things I tried step-by-step.

I get the following message:

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

I changed the code for the c4 data to the following:

traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Then, I started getting the following error:

File "/simla/wanda/lib/data.py", line 48, in get_c4
traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1118, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/.local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 92, in verify_splits
raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

I tried downloading with:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

After downloading the whole dataset, I need to change the load_dataset function to call the local files. So I did the following:

traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
 valdata = load_dataset('/simla/wanda/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation', trust_remote_code=True)

Now I am getting the following error:

Failed to read file '/simla/wanda/c4/en/c4-train.00000-of-01024.json.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Generating train split: 0%| | 0/364868892 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
dataset = json.load(f)
File "/usr/lib/python3.10/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables
raise e
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
pa_table = paj.read_json(
File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/simla/wanda/main.py", line 110, in
main()
File "/simla/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/simla/wanda/lib/prune.py", line 132, in prune_wanda
dataloader, _ = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer)
File "/simla/wanda/lib/data.py", line 80, in get_loaders
return get_c4(nsamples, seed, seqlen, tokenizer)
File "/simla/wanda/lib/data.py", line 50, in get_c4
traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Publish the Llama2 sparsified models

Hi,

I was wondering if you plan to put in a public domain the sparsified Llama2 models. In particular I am interested in the Llama2-70B with 50% unstructured sparsity.

Thanks!

why choosing vicuna as the tokenizer

The tokenizer is defined as follows in the finetune_lm.py in the lora_ft folder:

# tokenizer = AutoTokenizer.from_pretrained(model_args.config_name, use_fast=False)
## we use the tokenizer from vicuna
if "decapoda-research" in model_args.config_name:
    tokenizer = AutoTokenizer.from_pretrained(
        "lmsys/vicuna-13b-delta-v0",
        cache_dir=model_args.cache_dir,
        padding_side="right",
        use_fast=True,
    )

Could you please explain why choosing the tokenizer from Vicuna?

error in loading datasets

I was trying to load the wiki dataset, but i got this error

traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
File "/home/aelkordy/.conda/envs/prune_llm/lib/python3.9/site-packages/datasets/load.py", line 1804, in load_dataset
ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
File "/home/aelkordy/.conda/envs/prune_llm/lib/python3.9/site-packages/datasets/builder.py", line 1108, in as_dataset
raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).name} is not supported.")
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

I got similar error for C4.

how to save the model file/ckpt?

right now it only saves the log result

tokenizers `use_fast = False`

why is this set to false? It removes the use of GPTXNeo variants.

I set it to True and get this error:

python main.py \
    --model databricks/dolly-v2-3b \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/
torch 2.0.1
transformers 4.30.2
accelerate 0.20.3
# of gpus:  1
loading llm model databricks/dolly-v2-3b
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Downloading (…)/main/tokenizer.json: 2.11MB [00:00, 7.60MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 228/228 [00:00<00:00, 665kB/s]
use device  cuda:0
pruning starts
loading calibdation data
Downloading readme: 2.38kB [00:00, 4.36MB/s]
Downloading and preparing dataset json/allenai--c4 to /Users/eggie5/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319M/319M [00:13<00:00, 23.4MB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.20s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.27s/it]
Dataset json downloaded and prepared to /Users/eggie5/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
Downloading and preparing dataset json/allenai--c4 to /Users/eggie5/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.5M/40.5M [00:01<00:00, 23.6MB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.20s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.33it/s]
Dataset json downloaded and prepared to /Users/eggie5/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
Traceback (most recent call last):
  File "/Users/eggie5/Development/wanda/main.py", line 88, in <module>
    main()
  File "/Users/eggie5/Development/wanda/main.py", line 65, in main
    prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
  File "/Users/eggie5/Development/wanda/lib/prune.py", line 132, in prune_wanda
    dataloader, _ = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=2048,tokenizer=tokenizer)
  File "/Users/eggie5/Development/wanda/lib/data.py", line 73, in get_loaders
    return get_c4(nsamples, seed, seqlen, tokenizer)
  File "/Users/eggie5/Development/wanda/lib/data.py", line 55, in get_c4
    i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
  File "/Users/eggie5/.pyenv/versions/3.10.0/lib/python3.10/random.py", line 370, in randint
    return self.randrange(a, b+1)
  File "/Users/eggie5/.pyenv/versions/3.10.0/lib/python3.10/random.py", line 353, in randrange
    raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (0, 0, 0)

"line 39, in main parser.add_argument('--save_model', typr=str, default=None, help='Path to save the pruned model.')"

little typo there, would putting my own dataset into here made a difference? ive been having an itch since i heard about sparsegpt to see how close i can tool it to task orient a model

On results of using activation only

Have you explored the ablated case that uses activation only to conduct the pruning?

LoRA fine-tuning hyper-parameters

It's really awesome study. May I have LoRA hyper-parameter and configuration to reproduce the results Section 5.2 describes?

How to fine tune unstructured sparse models with LoRA?

It's a simple but great work! As the question shows, I don't quite understand how to fine-tune by LoRA for the model after unstructured pruning. Is it fine-tuned in a form similar to W=W+mAB, where m is the mask matrix of (C_out, C_in)？My main concern is that the additional weights of LoRA will significantly reduce the sparsity of the model. If I've misunderstood anything, I'm very much looking forward to your corrections!

Pruning image classification models (mlp_mixer)

Thank you for your project. I noticed the presence of the mlp_mixer models in the project. I am curious if sparsification has been implemented on this model or on other image classification models.

pruned model load slowly

Thank you for your works! When I save the pruned model, it loads slowly when I use it, much slower than the original model. What's going on here?

Can this be used for pruning Whisper?

I am looking for efficient ways to prune Whisper. Can this approach be used for that?

Ambiguous result for LLAMA-2-13b

using the original code, prune_wanda with sparasity 50%, the ppl is 50000+
while for the original llama-2-13b, the ppl is regular 4.5+.
Before that, the code perform well for llama-7b and llama-2-7b, all the ppl result is right refer to the paper.
Is there any more arguments should be set for the code for 13b model?

为了避免我拙劣的英语影响表达，同时看了作者大大的履历，我用中文再写一遍：
我用了仓库里的代码，在7b上的llama、llama2模型都表现正常，但是在13b的llama2模型上表现异常。
我再三确定了：
ppl评价和13b模型都没有问题，不剪枝的llama2的ppl表现正常。
prune代码就是copy的作者的，我自己检查也没有任何问题。
我用的所有参数基本上都是作者代码中默认的参数。
请问作者是否在13b上测试过代码可用性呢？是需要一些特殊的参数修改吗？

AttributeError: 'NoneType' object has no attribute 'to'

I am trying to prune with
python main.py
--model mistralai/Mistral-7B-Instruct-v0.2
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type unstructured
--save out/mistral_7b/unstructured/wanda/
and the output is as below.
torch 2.3.0
transformers 4.41.0.dev0
accelerate 0.31.0.dev0

of gpus: 2

loading llm model mistralai/Mistral-7B-Instruct-v0.2
^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:30<01:01, 30.79s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:46<00:21,$
use device cuda:0
pruning starts
loading calibdation data
dataset loading complete
Traceback (most recent call last):
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 110, in
main()
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/lib/prune.py", line 144, in prune_wanda
inps, outs, attention_mask, position_ids = inps.to(dev), outs.to(dev), attention_mask.to(dev), position_ids.to(dev)
AttributeError: 'NoneType' object has no attribute 'to'

Cannot reproduce Llama2 results

Hello, I'm opening this issue because I'm still having problems with reproducing the llama 2-7b results (both without pruning and using wanda). Here are my intermediate and final perplexity results with the dense model (with context size 4096). It seems like the last few samples are somehow messing up the perplexity but I don't know why. Any help would be appreciated.
nsamples 333
sample 50, Perplexity 5.0264153480529785
sample 100, Perplexity 5.311441421508789
sample 150, Perplexity 5.710564136505127
sample 200, Perplexity 5.612466335296631
sample 250, Perplexity 5.526543617248535
sample 300, Perplexity 6.8109965324401855
wikitext perplexity 7.72459077835083

Can wanda speedup LLM inference performance?

I have tested a llama-13b model, the pruned model's latency is almost the same as the raw model.

HellaSwag numbers?

Great work on this project!
In Table 20 of the LLAMA-2 paper, it says that LLAMA-2 gets 77.2 accuracy on HellaSwag. The LLAMA-2 paper isn't clear on whether this is zero-shot, but Table 20 of the Falcon paper confirms that it is zero-shot. However, in Table 25 of the Wanda paper, it says that LLAMA-2 Dense gets 57.17 accuracy on HellaSwag.

This seems like a large gap. Could you help me to understand the gap? E.g. are there multiple metrics, or multiple versions of the dataset, or something else that could cause a gap like this?

is it possible to prune gptq models?

def get_llm(model, cache_dir="llm_weights"):
    model = AutoModelForCausalLM.from_pretrained(
        model, 
        torch_dtype=torch.float16, 
        cache_dir=cache_dir, 
        low_cpu_mem_usage=True, 
        device_map="auto"
    )

    model.seqlen = 2048
    return model

I am interested if it is possible to prune 4-bit gptq models, also maybe of different sequence length?

Change sparsity rates

I used unstructured prune with sparsity rate of 0.8 instead of 0.5 for llama2-7B, but the effect was not good，even worse than 0.5. What could be causing this? Why did you only consider sparsity rate of 0.5 in your experiments? Thasks a lot.

Wanda Prunning for Zero Shot Object Detection Model - google/owlv2-base-patch16-ensemble

Hi Team,

Like to apply Wanda pruning for google/owlv2-base-patch16-ensemble model ( https://huggingface.co/google/owlv2-base-patch16-ensemble)

Need your help. What changes need to be made in the existing image_classifiers code?

regards

How can the pruned model with sparse matrix save model size and computation cost?

Thank you so much for sharing this work and code!
But from my understanding, Wanda is actually not decreasing the number of parameter of the model, instead, it simply set a lot of parameter to 0.

In this way, the total number of parameter remains the same as the original model. So the model size will be the same. I checked this by print out the #param of the pruned model and original model, the #param looks identical
Although a lot of parameters are set to 0, but if no special optimization is designed for the forward pass, 0 is still involved in the multiplicaiton and addition operation. So the computation cost will be the same.
Am I understanding right?

calibration data seq_length

Hi, I have a question about calibration data (128, 2048 tokens, respectively)

Is there a particular reason to use 2048 tokens for each data?
I tracked SparseGPT, GPTQ, but I couldn't find it.
I hope I can get some insight.

Thank you.

Request for Code Related to Zero-shot Task Evaluation Results in Table 3

Hi,

I'm specifically interested in the zero-shot task evaluation results that are detailed in Table 3. However, I noticed that the corresponding code for this specific section doesn't seem to be available in the repository.

To further understand and potentially build upon this work, I kindly request you to consider sharing the related code or guide me to where I might find it, if it's located elsewhere.

Thank you for your time and consideration, I really appreciate the effort you put into the research and making this codebase available to the public.

locuslab / wanda Goto Github PK

wanda's Introduction

Pruning LLMs by Weights and Activations

Update

Setup

Usage

Pruning LLaMA-2

Ablation on OBS weight update

Zero-Shot Evaluation

Speedup Evaluation

Acknowledgement

License

Questions

wanda's People

Contributors

Stargazers

Watchers

Forkers

wanda's Issues

of gpus: 2

Recommend Projects

Recommend Topics

Recommend Org