shizhediao / r-tuning Goto Github PK

[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't Know'"

Home Page: https://arxiv.org/abs/2311.09677

Python 100.00%

r-tuning's Introduction

R-Tuning: Teaching Large Language Models to Say 'I Don't Know'

🏆 R-Tuning receives the Outstanding Paper Award at NAACL 2024. 🎉

This is the official repo for the NAACL 2024 paper R-Tuning: Instructing Large Language Models to Say 'I Don't Know'.

Introduction

A predominant issue of Large language models (LLMs) is the propensity to generate non-existent facts, a concern termed hallucination. Our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. When the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the disparity in knowledge encompassed by pre-trained parameters compared to that of instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from responding to questions beyond its parametric knowledge. Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty results in better calibration and an improved ability to estimate the uncertainty than uncertainty-based testing.

The illustrations are shown below.

Getting Start

git clone https://github.com/hanningzhang/R-Tuning-code.git
cd R-Tuning-code

Dataset Download

Please download the datasets from the link and put the folder inside the R-Tuning-code/dataset folder.

https://drive.google.com/drive/folders/17v7IbnAPXX1NQpqjlDMhhxFK0cuNYSd6?usp=sharing

Requirements

git clone -b v0.0.5 https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
bash install.sh
cd ..

The LMFlow environment contains all the packages needed.

Constructing Training Datasets

Here we provide 5 training datasets we use. Please change the directory and run codes to get the refusal-aware datasets.

We provide an example of running ParaRel dataset with open_llama_3b model. Other datasets are exactly the same.

cd training
cd pararel
python run_pararel.py \
--model openlm-research/open_llama_3b \
--method unsure

The constructed datasets will be stored in a new directory training_data

Fine-tuning with LMFlow

Here is an example to finetune a open_llama_3b base model

Please feel free to replace --model_name_or_path with other HuggingFace models

cd ~/LMFlow
./scripts/run_finetune.sh \
  --model_name_or_path openlm-research/open_llama_3b \
  --dataset_path ../training/training_data \
  --output_model_path output_models/finetuned_llama_3b

Evaluation

Here is an example to evaluate the open_llama_3b model on 'ParaRel' dataset.

Please replace --model with any R-Tuning models.

cd ~/evaluation/pararel
python evaluate.py \
--model openlm-research/open_llama_3b \
--domain ID \
--result ParaRel_openllama_3b

After receiving the result, please run the following command for Average Precision (AP) score evaluation:

cd results
python calculate_ap.py --result ParaRel_openllama_3b.json

Citation

If you use or extend our work, please cite the following paper:

@inproceedings{zhang-etal-2024-r,
    title = "{R}-Tuning: Instructing Large Language Models to Say {`}{I} Don{'}t Know{'}",
    author = "Zhang, Hanning  and
      Diao, Shizhe  and
      Lin, Yong  and
      Fung, Yi  and
      Lian, Qing  and
      Wang, Xingyao  and
      Chen, Yangyi  and
      Ji, Heng  and
      Zhang, Tong",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.394",
    pages = "7113--7139",
}

r-tuning's People

Contributors

Stargazers

Watchers

Forkers

hanningzhang rayjue choco9966 jerry-jwz realjaydencheng

r-tuning's Issues

question about MMLU training sample

Thanks for your promising work!
I have a question regarding the MMLU training data. Upon running MMLU/run_MMLU.py, I obtained the MMLU training data where each instance looks like:
The following are multiple choice questions (with answers) about college physics.\n\nA refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is\nA. 4\nB. 5\nC. 6\nD. 20\nAnswer:A\n\nFor which of the following thermodynamic processes is the increase in the internal energy of an ideal gas equal to the heat added to the gas?\nA. Constant temperature\nB. Constant volume\nC. Constant pressure\nD. Adiabatic\nAnswer:B\n\nOne end of a Nichrome wire of length 2L and cross-sectional area A is attached to an end of another Nichrome wire of length L and cross- sectional area 2A. If the free end of the longer wire is at an electric potential of 8.0 volts, and the free end of the shorter wire is at an electric potential of 1.0 volt, the potential at the junction of the two wires is most nearly equal to\nA. 2.4 V\nB. 3.3 V\nC. 4.5 V\nD. 5.7 V\nAnswer:A\n\nA refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is\nA. 4\nB. 5\nC. 6\nD. 20\nAnswer:A\n\nThe muon decays with a characteristic lifetime of about 10^-6 second into an electron, a muon neutrino, and an electron antineutrino. The muon is forbidden from decaying into an electron and just a single neutrino by the law of conservation of\nA. charge\nB. mass\nC. energy and momentum\nD. lepton number\nAnswer:D\n\nTwo spaceships approach Earth with equal speeds, as measured by an observer on Earth, but from opposite directions. A meterstick on one spaceship is measured to be 60 cm long by an occupant of the other spaceship. What is the speed of each spaceship, as measured by the observer on Earth?\nA. 0.4c\nB. 0.5c\nC. 0.6c\nD. 0.7c\nAnswer:B. Are you sure you accurately answered the question based on your internal knowledge? I am unsure.

I'm curious if the entire sentence is taken into account when calculating the loss. The paper mentions "Note that we calculate the loss solely forthe answer part and the uncertainty part, while excluding the loss attributed to the question part.". However, I'm having trouble locating the code that masks the sentence.

Thanks so much for your help!

Some Questions about MMLU Dataset

Hi,
Thank you for your great work!
In your paper, you have conducted experiment on MMLU dataset, and it seems that you have done R-tunning on MMLU dataset.
I found a MMLU dataset on huggingface, which link is https://huggingface.co/datasets/lukaemon/mmlu and I saw that there are only 4 training samples for each category. Just want to confirm that did you use just 4 samples for each categories for fine-tuning? Or you have re-split this dataset and using more samples for finetuning? I am concerning 4 samples might be too few...

Thank you for your time and patience. Looking forward to your reply.

IndexError when running evaluation

Hi. I tried to follow the README example, and when I came to the Evaluation step, an IndexError occurred when I ran python evaluate.py --model openlm-research/open_llama_3b --domain ID --result ParaRel_openllama_3b:

I think the error is due to that in evaluate.py, the SURE and UNSURE are lists of encodings but not their ids:
SURE.append(tokenizer("sure")[0]) UNSURE.append(tokenizer("unsure")[0])

About Dataset "NEC"

Hi there!

Thank you for your excellent work！

I am interested in the open-domain dataset NEC, but I am troubled by unable to find relevant information.

Can you help me, such as providing relevant papers or dataset links? Thanks a lot!

Looking forward to the opensource !

Training configuration and hardware spec

Hi!
Congratulation on a very interesting work and thank you for releasing the code :)

I am running some experiments and would like to reproduce some results.
I had some questions regarding the training configurations.

I assume you did full finetuning when reading the instructions. Would you confirm this?
When training the 7B model using LMFlow, I am faced with CPU OOM with a server with 220GB RAM. I believe this is abnormal and may be a problem on my side. If you recall how many CPU memory were required, can you tell me?
Which LLaMA weights did you use? If you used the ones in hugginface, can you tell me the repo id?

Thanks.

The accuracy on FalseQA

I had the pleasure of reading your paper and found it very interesting, but I have a question. The 5.4 part of the paper uses the rejection rate metrics, while the FalseQA dataset officially uses Precision, Accuracy, Recall, etc. May I ask how R-Tuning performs on these metrics?