pratyushasharma / laser Goto Github PK

View Code? Open in Web Editor NEW

322.0 22.0 20.0 2.37 MB

The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Home Page: https://pratyushasharma.github.io/laser/

License: MIT License

Python 100.00%

gpt-j interpretability laser llama2 llm llms model-compression transformers

laser's Introduction

Layer-Selective Rank Reduction

This repository contains code for the paper "The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction," by Pratyusha Sharma, Jordan T. Ash and Dipendra Misra ICLR 2024.

Website: https://pratyushasharma.github.io/laser/

Updates:

Jan 18th, 2024: Refactoring is happening in the refactor branch. We are working to release it quickly and thank you for your patience.
Jan 7th, 2024: Results table has been created on the website.
Jan 4th, 2024: Discussions page is open. Feel free to use it to suggest new topics/ideas/results that are not covered by issues.

This is an early development release. We will do a major refactor in Jan 2024 to make the code easier to use and more flexible.

We welcome issues and pull requests. If you report a new result using LASER on a given LLM and NLP task, please issue a pull request and we'll add it to the website's leaderboard.

What is Layer-Selective Rank Reduction?

LAyer-SElective Rank-Reduction, abbreviated as LASER, is an intervention where we replace a selected weight matrix in the transformer architecture of an LLM with its low-rank approximation. A single LASER transformation consists of 3 hyperparameters: the layer number to modify (ℓ) such as 16th layer, the parameter type (τ) such as the first MLP layer, and the fraction of the maximum rank to retain (ρ) such as 0.01 fraction of the rank. We can write this transformation as (ℓ, τ, ρ) and we can compose these transformations and apply them in parallel. The low-rank approximation is performed using SVD. Figure below from our paper shows an illustration.

LASER can give significant performance improvements on question-answerting tasks without additional model training. Our paper presents various results related to evaluating LASER on 3 different LLMs and several LLM benchmarks. This repository contains the code to reproduce these results.

How to run a sample code

We first discuss installing the code and then discuss how to run an experiment.

Installation

To install the experiment, please install the pip file. We chiefly just need pytorch and the datasets and transformers package from huggingface. It might be a good idea to create a conda environment.

pip3 install -r requirements.txt

Optionally, if you want to experiment with the CounterFact dataset then run the following script to download it. All other datasets are available on HuggingFace.

python scripts/get_counterfact.py

Run a sample code

At the moment, each setup is its own file. To run an experiment that performs a single LASER transformer to GPTJ on the Fever dataset, you can run:

python3 intervention_gptj_fever.py --lname fc_in --rate 9.9 --lnum 26

here lnum is ℓ, lname is τ, and rate is related to ρ by ρ = 1 - 0.1 * rate. The rate is a value between [0, 10.0] and measures how many components to throw away with 10 means all components are thrown away and we get a 0 matrix and 0 means all components are retained and we retain the original matrix. The use of rate is for legacy reasons and we will refactor the code to directly use ρ in the future. The mapping for lname that we use is:

lname	description
dont	use the base model and dont perform intervention
fc_in	first layer of MLP
fc_out	second layer of MLP
fc_up	a third MLP weight matrix in some LLM, used for Hadamard multiplication
mlp	all MLP weight matrices {fc_in, fc_up, fc_out}
k_proj	key matrix in self attention
v_proj	value matrix in self attention
q_proj	query matrix in self attention
out_proj	output matrix in self attention
attn	all attention weight matrices

Please do note that if you add a new LLM, then you have to adapt the laser package to implement mappings. For example, see the mappings for Llama2 here. You also need to update the Laser wrapper to work with the new LLM here.

Note that the above experiments will save accuracies and log-losses for each datapoint. In some files, one has to take the validation set (first 20% examples) and do hyperparameter selection separately, and then compute the accuracy on the test set (remaining 80% examples) with the chose hyperparameters. In the future, we will refactor the code to make this very easy to do.

Code Organization

Code is inside the src folder. The main experiment files are top-level inside the src. The filename convention is intervention_<llm-name>_<dataset-name>.py where <llm-name> is the name of the LLM and <dataset-name> is the name of the dataset. For BigBench, the dataset split is often specified with an additional flag --split. Please see the codebase for details of command line arguments. We will provide a comprehensive tutorial later.

The code for performing laser is inside the laser package. We use PyTorch to do SVD and compute low-rank approximation. The code for low-rank approximation happens here. The code for reading and processing dataset is inside dataset_util. Finally, metrics and logging are done using the study_utils.

Citation

If you find this codebase useful, then please cite the following paper. Additionally, feel free to send a PR or an email and we will cite your result/paper on the leaderboard.

@article{sharma2023truth,
 
  title={The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction},

  author={Sharma, Pratyusha and Ash, Jordan T and Misra, Dipendra},

 journal={arXiv preprint arXiv:2312.13558},

   year={2023}
 }

laser's People

Contributors

Stargazers

Watchers

laser's Issues

Do you think it could work for MoE models like Mixtral?

Title.

Where to Get the Dataset

Hi,
Thank you so much for making this project! I see that there's a CLI argument for dataset_file, do you know what I should point it to for the counterfact method?
Thank you!

What is the ETA on the code

Thanks for the awesome work. I would like to try it out. What is the ETA on the code?

how to get base model accuracy

I should run which command to obtain the accuracy of the base model？

Application to three-dimensional tensors

Can you consider implementing LASER on three-dimensional tensors? For example, use this method for the conv3d architecture.

what does the 'rate' parameters actually mean in code?

The github readme says 'rate' measures how much rank to retain, but in the code implements "results = torch.svd_lowrank(weight, q=desired_rank, niter=niter)" confuses me a bit. desired_rank means to retain, desired_rank = max_rank * k so k means how much to retain, then k = (10 - rate) * 0.1, so rate should mean how much to reduction? Is there something wrong with my understanding?

Question

Hi,
Thanks for releasing this code. Does this codebase decrease the size of the model (ie file size, required VRAM)?
Thank you!

Generic model?

Thanks for publishing this excellent work. If I understand correctly, you run LASER intervention separately for each evaluation task.

Would it be possible to make one LASER model that is generic to all tasks? My goal is to compress LLAMA-v2-7B to be smaller, for executing faster on mobile devices.

Also, is it correct that you just apply LASER to one layer of the model? I was wondering, did you try applying it to most of the layers?

Problem Encountered During Reproduction

Thank you for sharing your great work. While reproducing your project, I encountered an issue that I hope you can provide some help with. I executed the following commands:
python3 intervention_gptj_fever.py --lname fc_in --rate 9.9 --lnum 26 python3 intervention_gptj_fever.py --lname fc_in --rate 9.0 --lnum 26 python3 intervention_gptj_fever.py --lname dont
However, I noticed that the LASER did not yield any performance improvement as expected. I've attached the log files for your reference. Could you kindly look into them and assist me in identifying any potential problems in my implementation?
GPTJ-log-26-fc_in-9.0.txt
GPTJ-log-24-dont-1.txt
GPTJ-log-26-fc_in-9.9.txt

Rank-reduced models?

Do you publish the rank-reduced models anywhere?

method of composing reductions across layers

Hello! Thanks for your idea and codes, and I am applying the code to my model. There are two questions for me now：

The paper says "greedily search" over the parameters and have a "simple compose strategy" when composing reductions across layers。Does this mean that search the best rate in different later MLP layers and then simply compose them?
Can I use a single command line to realize composing reductions across layers? Or i need to repeat doing intervention on a single layer for a few times to compose reductions?

Thank you!

Mistral Support

Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!

Excellent work, looking forward to following up with further research!

In sections 5.1 and 5.2, I have a few questions

In the counterfact dataset, we should not only compare the accuracy of top k, but also consider the ES,EM metrics mentioned by Meng, which may increase the probability of wrong words while increasing the probability of correct words (The native language of Danielle Darrieux is English. Danielle Darrieux's native language is French).

I think if a specific LASER is performed on each dataset individually, although it will significantly improve the prediction performance, there is a risk of overfitting? I think a uniform set of hyperparameters should be found to reduce the RANK to demonstrate the effectiveness of this approach.
Nevertheless, I think this is a very worthwhile endeavor and it gives us a very valuable insight into the inner workings of the transformer.

Reference：
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in
GPT. Advances in Neural Information Processing Systems, 36, 2022.

Feature Request for Upcoming Refactoring

This is the issue that contains list of all features for the upcoming refactoring:

A unified abstract class that does all common stuff like create command line arguments, make an LLM, and run the experiment. We may have only 1 file per LLM (or per LLM type) and this abstract class. We may not be able to get it down to a single file since certain LLMs like Roberta which are really Masked Language Models have a different procedure to computing accuracy and log-loss using the tokens.
Replace the use of rate with ρ which is used in the paper.
Add a feature to support memory reduction by storing separate U, S, V matrices rather than multiplying them back and loosing the memory advantage.
Add more LLMs, specifically, Mistral and other Llama2 versions and Phi models.
Release LLMs with optimally chosen reductions from Table3 of the paper https://arxiv.org/pdf/2312.13558.pdf.

If you have more requests, please paste them below. Do note that the first version of refactoring may not be able to do all of the above, but we'll do our best. We welcome PRs.

Potential improvements for evaluation

Thanks for providing the code for this promising research. I'm looking forward to see how far this idea can be pushed. It would be especially cool if a heuristic could be found that applies this technique to multiple layers chosen in a way that works across different models.

When I investigated the results a bit closer and ran some of the benchmarks locally, I came across some potential issues. Specifically, I took a look at the BigBench-Epistemic Reasoning benchmark, but I suspect that others could also be affected. First of all, I noticed that the accuracy of the models without intervention was below 50% (Tab. 1). For a binary classification task, this is strange. When debugging the results, I found that for Roberta and GPT-J (haven't tested Llama), the models always predicted the same label, and since that label was used in 37% of samples in the dataset, that's also their accuracy. As Llama has 63% accuracy with intervention, I suspect that it simply always predicts other label.

Digging a bit deeper, I found the logits for the label tokens to be extremely small. This typically happens when the model is somehow "derailed" and wants to predict neither of the tokens. Sometimes, this simply comes down to tokenization: Often, the models try to predict " True" and " False" (leading whitespace) because this is how they tokenize the text. Other times, they want to go in a completely different direction. I would recommend to log the absolute probabilities of the label tokens and double-check when they are too low. Often, this can be fixed by slight adjustments to prompts or labels.

Also, there is a typo in this prompt: "entails" => "entail".

I hope this is helpful.

Llama2-7B + TruthfulQA reproduce issue

Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result

License

Hi,
Great work on this codebase! Would you mind adding a license, ie MIT/Apache/ISC?
Thank you!

Rank reduction using random matrix theory

Hi! I really enjoyed the paper.
I've implemented a version that uses marchenko pastur in order to speed up the search, instead looking within a grid search.

If it's of your interest, we can join efforts.

I would be glad if you could take a look at https://github.com/cognitivecomputations/laserRMT

Congratulations for your work.

Best,

Fernando