yuchenlin / llm-blender Goto Github PK

[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the diverse strengths of multiple open-source LLMs. LLM-Blender cut the weaknesses through ranking and integrate the strengths through fusing generation to enhance the capability of LLMs.

Home Page: https://yuchenlin.xyz/LLM-Blender/

License: Apache License 2.0

Shell 6.50% Python 87.25% Jupyter Notebook 6.25%

ensamble-methods large-language-models llm natural-language-processing

llm-blender's Introduction

LLM-Blender: Ensembling LLMs with Pairwise Ranking & Generative Fusion [ACL2023]

Authors: Dongfu Jiang, Xiang Ren, Bill Yuchen Lin @ AI2-Mosaic USC-INK

🔥News

[2024/1/5] PairRM can now be directly loaded using Hugging face Wrapper DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf"), see more in our 🤗Model page
[2023/11/10] Glad to announce that our pairwise reward-model, 🤗PairRM, has released. It's trained on high-quality and large-scale human reference dataset and approaches GPT-4's alignment with human preference with a extremly small model size (0.4B).
[2023/10/24] Pre-trained PairRanker is able to be loaded directly from 🤗 Hugging face Models llm-blender/PairRM within 3 lines of code. See Guidance for Rank & Fusion for details.

Overview

Abstract

We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). LLM-Blender cut the weaknesses through ranking and integrate the strengths through fusing generation to enhance the capability of LLMs.
Our framework consists of two complementary modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. GenFuser aims to merge the top-ranked candidates from the aggregation of PairRanker's pairwise comparisons into an improved output by capitalizing on their strengths and mitigating their weaknesses.
To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons for testing purposes. Our LLM-Blender significantly surpasses the best LLMs and baseline ensembling methods across various metrics on MixInstruct, establishing a substantial performance gap.

Usage

Installation

pip install llm-blender
# pip install git+https://github.com/yuchenlin/LLM-Blender.git

Then you are good to go through our LLM-Blender with import llm_blender.

For development, you can clone the repo and install it locally.

git clone https://github.com/yuchenlin/LLM-Blender.git
cd LLM-Blender
pip install -e .

Use case 1: (Re-)Ranking model outputs by pairwise comparisons

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint

Then you can rank with the following function

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""

Using llm-blender to directly compare two candidates

inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# comparison_results[0]--> True

You can also fuse the top-ranked candidates with the following code

blender.loadfuser("llm-blender/gen_fuser_3b") # load fuser checkpoint if you want to use pre-trained fuser; or you can use ranker only
from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks
topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)
fuse_generations = blender.fuse(inputs, topk_candidates, batch_size=2)
# fuse_generations are the fused generations from our fine-tuned checkpoint

# You can also do the rank and fusion with a single function

fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, return_scores=False, batch_size=2, top_k=3)

Use case 2: Best-of-N Sampling (Re-ranking)

Best-of-n Sampling, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more atOpenAI WebGPT section 3.2 and OpenAI Blog).

Best-of-n sampling is a easy way to improve your LLMs by sampling and re-ranking with just a few lines of code. An example of applying on Zephyr-7b is as follows.

import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}

inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]

# standard sampling generation 
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> `Sure` 

# using our PairRM for best-of-n sampling
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:")
print(prompts[0])
print("### best-of-n generations:")
print(outputs[0])
# --> 
""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""

Use case 3: Used as a local Pairwise Evaluator and for better RLHF

Our latest 🤗PairRM, which has been further trained on various high-quality and large-scale dataset with human preference annotations, has shown great correlation with human preferences with an extremely small model size (0.4B), approaching the performance of GPT-4. (See detailed comparison in 🤗PairRM)

To get scalar rewards, you can use blender.rank_with_ref method (see the example below). This method compares all the candidates with the reference and returns the relative scalar rewards.

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
rewards = blender.rank_with_ref(inputs, candidates_texts, return_scores=True, batch_size=2, mode="longest")
print("Rewards for input 1:", rewards[0]) # rewards of candidates for input 1
"""
rewards is a List[List[float]] of shape (len(inputs), len(candidates_texts[0])).
representing the rewards of each candidate for each input.
By default, the rewards are calculated based on the the comparison with the longest generation as a reference.(mode="longest").
other supported modes are "shortest" "median_length" "first" "last"
"""

You can also pass a list of references to compare with, instead of automatically selecting one from the candidates as the fixed reference.

ref_candidates = [_c[0] for _c in candidates_texts] # use the first candidate as the reference, same as mode="first"
rewards = blender.rank_with_ref(inputs, candidates_texts, return_scores=True, batch_size=2, ref_candidates=ref_candidates) 
"""
ref_candidates = [ref1, ref2, ref3, ...] # ref_candidates is a List[str], shape (len(inputs),)
this parameter will override the mode parameter, and use the ref_candidates as the reference for reward calculation.
rewards is a List[List[float]] of shape (len(inputs), len(candidates_texts[0])).
"""

You can easily integrate PairRM to popular RLHF toolkits like trl.

Use case 4: DPO (Direct Preference Optimization) with PairRM

PairRM's blender.compare naturally supports DPO, which is a direct preference optimization method to optimize the model with the pairwise comparison signal.

Load PairRM with hugging face `from_pretrained()`

In this way, you don't need to install llm-blender to use PairRM. More custom development can be achived based on the model

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM # or copy the DebertaV2PairRM definition here, https://github.com/yuchenlin/LLM-Blender/blob/main/llm_blender/pair_ranker/pairrm.py
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
    ids = []
    assert len(sources) == len(candidate1s) == len(candidate2s)
    max_length = source_max_length + 2 * candidate_max_length
    for i in range(len(sources)):
        source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
        candidate_max_length = (max_length - len(source_ids)) // 2
        candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
        candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
        ids.append(source_ids + candidate1_ids + candidate2_ids)
    encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
    return encodings

encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

Demo

🔥 Check more details on our example Jupyter notebook usage: blender_usage.ipynb

Data Release

To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons for testing purposes.
MixInstruct is the first large-scale dataset consisting of responses from 11 popular open-source LLMs on the instruction-following dataset. Each split of train/val/test contains 100k/5k/5k examples.
MixInstruct instruct is collected from 4 famous instruction dataset: Alpaca-GPT4, Dolly-15k, GPT4All-LAION and ShareGPT. The ground-truth outputs comes from either ChatGPT, GPT-4 or human annotations.
MixInstruct is evaluated by both auto-metrics including BLEURT, BARTScore, BERTScore, etc. and ChatGPT. We provide 4771 examples on test split that is evaluated by ChatGPT through pairwise comparison.
Code to construct the dataset: get_mixinstruct.py
HuggingFace 🤗 Dataset link

Training

Train PairRanker

# installation
pip install -e .[train]

See more details in train_ranker.sh

Please follow the guide in the script to train the ranker.

Here are some explanations for the script parameters:

Changing the torchrun cmd

TORCHRUN_CMD=<you torchrun cmd path>

Normally, it's just torchrun with proper conda env activated.

Changing the dataset

dataset="<your dataset>`

Changing the ranker backbone

backbone_type="deberta" # "deberta" or "roberta"
backbone_name="microsoft/deberta-v3-large" # "microsoft/deberta-v3-large" or "roberta-large"

Changing the ranker type

ranker="Pairranker" # "PairRanker" or "Summaranker" or "SimCLS"

Filter the candidates used

candidate_model="flan-t5-xxl" # or "alpaca-native"
candidate_decoding_method="top_p_sampling" 
n_candidates=15 # number of candidates to generate
using_metrics="rouge1,rouge2,rougeLsum,bleu" # metrics used to train the signal

Do Training or Inference

do_inference=False # training
do_inference=True # inference

When doing inference, you can change inference_mode to bubble or full to select difference pairwise inference model

Limit the datasize used for training, dev and test

max_train_data_size=-1 # -1 means no limit
max_eval_data_size=-1 # -1 means no limit
max_predict_data_size=-1 # -1 means no limit

Do inference on dataset A with ranker training on dataset B

dataset=<A>
checkpoint_trained_dataset=<B>
do_inference=True

Resources

Toolkits

LLM-Gen: A simple generation script used to get large-scale responses from various large language models.

Model checkpoints

🤗PairRanker checkpoint fine-tuned on DeBERTa-v3-Large (304m)
🤗GenFuser checkpoint fine-tuned on Flan-T5-XL (3b)

PairRM Community

PairRM has been widely used in various applications, including but not limited to:

snorkelai/Snorkel-Mistral-PairRM-DPO (SOTA in Alpaca-eval leaderboard)
argilla/OpenHermesPreferences (1M+ preference datasets annotated by PairRM)

We are looking forward to more applications and contributions from the community 🤗!

Star History

Citation

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}

llm-blender's People

Contributors

Stargazers

Watchers

Forkers

hbcbh1999 jingli-wtbox codeaudit standardgalactic sksundaram-learning deltavml leejodie rfsfreitas cygwynd rishabh135 khushpatel2002 austintapp kuntal-c techthiyanes tomchapin yhyu13 babyblue26 lplzyp fourpartswater apollohuang1 aceanan allthingsllm klonggan vpegasus rioncarter wangwendong1024 marvsaidev tmukande-debug gyunggyung ii-research-yu yonglinz qingkongzhiqian keyman9848 jxzhangjhu fastflair wheaterw lwangreen webclinic017 azizimj weijiang2023 tgltommy leonsun128 sciumo suryatmodulus santoshdahale zguo0525 stjordanis lynnlangit rogervaas ssusantachary gio-system jinnaiyuu nickscherbakov jiayiping0401 awinml svorwerk-flextg ego tony92151 kevinbluett octag0no kmouratidis argilla-io koutchemecharles jonsaadfalcon kp-forks dnrtrdata

llm-blender's Issues

Replicate Experiments

Hi congratulations on your work! Do you plan to release the code for replicating your experiments? I only find the pairRanker and Genfuser modules.

How to act as reward model for RLHF

Hi,

I was wondering if you have example code for making PairRM a reward model for RLHF? Do we need a "base" response to compare a generated response from a PPO policy OR we can directly get a scalar score?

Thanks,
Andrew

How to use gen_fuser alone for merging?

Thank you for your excellent work. I have a question for you. If I want to use gen_fuser to merge two reports and output the final report, how should I do it?

Report 1: You are very cute, with big eyes and a beautiful face.
Report 2: You are very beautiful, with long hair and big eyes.

Thank you for your response.

Issue with calculating >= Vic and OA

I believe there is a mistake in calculating the metric ">= Vic" and ">= OA" by reimplementation.

For brief, it seems that when Oracle(BERTScore) select Vicunna as its best model ">= Vic" will set to be false, which violates its definiton of "better than or same good as". The same case on ">= OA". After correctly impementing both metrics:

Oracle (BERTScore) achieves 67.68, 81.91
Oracle (BLEURT) achieves 68.02, 78.12
Oracle (BARTScore) achieves 71.70, 73.53
Oracle (BARTScore+BLEURT) achieves 72.40, 78.45, which is better than LLM-Blender

My detail code is as follow:

metrics = ['bertscore', 'bartscore', 'bleurt', 'gVic', 'gOA']

def getRankCompare(gpt_cmps:dict, selected_model, base_model):
    if selected_model == base_model: return 1
    cmp_str, st = selected_model+','+base_model, True
    if cmp_str not in gpt_cmps.keys():
        cmp_str, st = base_model+','+selected_model, False
    cmp_result = gpt_cmps.get(cmp_str)
    if 'good' in cmp_result: return 1
    elif 'bad' in cmp_result: return 0
    elif (st and 'A' in cmp_result) or (not st and 'B' in cmp_result): return 1
    else: return 0
        
def getMetrics(data, idx): 
    selected_model = data['candidates'][idx]
    model_scores = selected_model['scores']
    model_name = selected_model['model']
    result = {}
    result['bertscore'], result['bartscore'], result['bleurt'] = model_scores['bertscore'], model_scores['bartscore'], model_scores['bleurt']
    cmp_results = json.loads(data['cmp_results'])
    if cmp_results is None:
        return result
    result['gVic'] = getRankCompare(cmp_results, model_name, 'vicuna-13b-1.1')
    result['gOA'] = getRankCompare(cmp_results, model_name, 'oasst-sft-4-pythia-12b-epoch-3.5')
    return result

def custom_compare(value, current_scores):
    current_value = current_scores['bleurt']
    if value < current_value: return True, current_value
    else: return False, -1

metrics_gater = {metric:[0, 0] for metric in metrics}
with jsonlines.open('./test_data_prepared.jsonl', 'r') as f:
    for data in f:
        candidates = data['candidates']
        value, idx = -1e8, -1
        for i, model_outputs in enumerate(candidates):
            scores = model_outputs['scores']
            result = custom_compare(value, scores)
            if result[0]:
                value = result[1]
                idx = i
        metric_results = getMetrics(data, idx)
        for key, val in metric_results.items():
            metrics_gater[key][0] += val
            metrics_gater[key][1] += 1
for key, val in metrics_gater.items():
    print(f'{key}: {val[0]/val[1]:.4f}')

by the way, changing line 4 to if selected_model == base_model: return 0 can achieve exactly the same performance announced in the paper.

Json parse error when downloading dataset from hf

I tried to download dataset but got error like

Generating train split: 75189 examples [08:46, 2716.79 examples/s]Failed to read file './hf_cache/downloads/c652d27b07a7ba1e26291b441d095
edd970a5045e4c1126d0d02e87e805d07e4' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in
 string. in row 6                                                                                                                        
Traceback (most recent call last):                                 
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 134, in _generate_tables
    dataset = json.load(f)
  File "/remote-home/klv/src/rtv/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/remote-home/klv/src/rtv/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/remote-home/klv/src/rtv/lib/python3.8/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 12658)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): 
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/builder.py", line 1879, in _prepare_split_single
    for _, table in generator:
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 137, in _generate_tables
    raise e
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 113, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 6

The above exception was the direct cause of the following exception: 


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/builder.py", line 909, in download_and_prepare
    self._download_and_prepare(
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/builder.py", line 1004, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/builder.py", line 1767, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/remote-home/klv/src/rtv/lib/python3.8/site-packages/datasets/builder.py", line 1912, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Split dependencies so that core install is light for model inference

Hello @yuchenlin 👋 !

We're big fans of PairRM at Hugging Face and would like to feature it in some callbacks we're adding to trl. The only drawback is that running

pip install git+https://github.com/yuchenlin/LLM-Blender.git

installs many dependencies besides those need for model inference (e.g. spacy, openai etc). Would it be possible to split your setup.py so that the core deps are just those for inference / running the model, and the remainder are optional?

To give you an idea of what I mean, here's how we handle this in our alignment handbook: https://github.com/huggingface/alignment-handbook/blob/main/setup.py

Thanks!

undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE

Hello, I'm getting error while trying to load pre-trained fuser model in llm_blender.Blender() object. It is giving me this error "undefined symbol:
_ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE". Is there any solution for this?

Training the GenFuser

Hello! Where in the code base do you include the code for training the GenFuser and setting up its training data?

Question about the fuser outputs

I am confused that all my outputs from the `fuser' are nonsense. For example, they all look like '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '. I am wondering whether this is due to my task being a classification task. Maybe the method is more suitable for traditional generation tasks.

Why all the models used in the experiments are released after ACL2023 submission deadline?

ACL2023 submission deadline is Jan, 2023, while basically all the llms used in the experiments are released after Feb, 2023. How comes this is an ACL2023 main paper with minor revisions? I am curious about the version of the paper that was actually reviewed.

Training ranker on Unified-Feedback

Hi,

Thank you for the impressive work!

have a question about the training of PairRM. From my understanding, PairRM is trained on Unified-Feedback, which is a dataset suite with pariwise comparison data. On the other hand, from your code (and also the mix-instruct dataset), the training data takes the form of {prompt, candidates, scores}, and the number of candidates is often larger than 2.

Can you comment on the difference of these two types of training? Also, in the pairwise setting, do you directly treat 'conv_A_rating' and 'conv_B_rating' (mostly either 0 or 1) in Unified-Feedback as scores? Because I find that train_ranker is not directly compatible with Unified-Feedback.

Thanks again!

Data Generation Code

Thanks for the work. Do you plan to release the code that how you generate the MixInstruct Dataset? It would be very helpful!

How to change models to train ranker?

Thank you for the amazing work.

I see that in the paper, you've taken N=11 models and then found candidates using them. Can you point to the code where we can change these N=11 models to let's say N=5 (different models that are used in this paper) and keep the GenFuser same?

Why the matrix M is not Symmetrical along the diagonal?

Issue with downloading dataset from HuggingFace

Thanks for releasing the dataset! I am trying to download it from HuggingFace using

from datasets import load_dataset
dataset = load_dataset("llm-blender/mix-instruct")

But this gives me the following error:

id: string
instruction: string
to
{'id': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'candidates': [{'decoding_method': Value(dtype='string', id=None), 'model': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'scores': {'logprobs': Value(dtype='float64', id=None), 'rougeL': Value(dtype='float64', id=None), 'rouge2': Value(dtype='float64', id=None), 'rougeLsum': Value(dtype='float64', id=None), 'rouge1': Value(dtype='float64', id=None), 'bleu': Value(dtype='float64', id=None), 'bertscore': Value(dtype='float64', id=None), 'bleurt': Value(dtype='float64', id=None), 'bartscore': Value(dtype='float64', id=None)}}]}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 1, in
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/load.py", line 1797, in load_dataset
builder_instance.download_and_prepare(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 890, in download_and_prepare
self._download_and_prepare(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 985, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 1746, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 1891, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

GPU Memory Requirement on Retraining PairRM

I am trying to train the PairRM (Pair Ranker) on Unified Feedback dataset (884,528 samples). Following is the modified config:

dataset="unified_feedback"
backbone_type="deberta" 
backbone_name="microsoft/deberta-v3-large"
n_gpu=4
ranker="PairRanker" 

 source_maxlength=1224
  candidate_maxlength=412
  per_device_train_batch_size=4
  per_device_eval_batch_size=1
  gradient_accumulation_steps=8
  using_metrics="human_preference"

Rest of the parameters are constant (disabled adafactor as in the zero config Adam was mentioned). Can you please share your insights of the initial training on GPU memory consumption. At present I am training it on 4 A100 GPUs and it is occupying more than 50 GB on each device (I assume that a 400M paramter model should not take more than 16GB on each GPU). Also the training with this setup seems to be slow, given the size of the backbone model.

@jdf-prog @yuchenlin Your insights/suggestions on this would be very helpful. Thanks

Load PairRM ranker without Internet access failed

Hi, and thank you for your outstanding work!

I recently attempted to execute the PairRM project on a development machine without internet access. Despite pre-downloading the hf model to a local directory, I encountered an issue during the load_ranker process where the model still attempts to download microsoft/deberta-v3-large. Could you please advise on how to bypass this automatic download? Or is it necessary for me to also download this specific model locally?

Thank you for your assistance!

Support for MPS device

Can you please indicate how I can use your wonderful package in a Mac machine with M2/3 chip?

Currently I tried:
blender.loadranker("llm-blender/PairRM", device='mps')
Which works fine, but
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
Gives me error as it expects CUDA...

Thanks in advance!

RuntimeError: in loading state_dict for CrossCompareReranker

While directly following the code from the readme, I get the following error:

RuntimeError: Error(s) in loading state_dict for CrossCompareReranker:
    Missing key(s) in state_dict: "pretrained_model.embeddings.position_ids"

The barebone code:

import llm_blender

blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM", device="CPU") # error thrown at during this

Appreciate any direction on this. I have the transformers==4.28.1

Make `llm-blender` available on PyPI

Hello @jdf-prog now that #22 is resolved, I'd like to make one final request to see whether llm_blender could be made available on PyPI so that one can run:

pip install llm_blender

This would allow us to include llm_blender as an optional dependency in TRL instead of asking the user to run the pip install git+{} command.

Thank you!

Issues with the split of dataset

There is maybe domain shift in the current split of training, validating and testing dataset by measuring the best rate of each model.

I select bartscore as the measurement, and analysis the best rate of each model in the dataset

where best rate = sum(is_best)/len(dataset).

It can be seen that there is a huge different in the following statics. For example OA, which achieves 14.47% on train, 21.38% on test, but 0.88% on validation:

Unable to reproduce the results in the paper

May I kindly inquire about the methodology used to obtain the final results presented in Table 2 of your paper? I have attempted to replicate the results without success. Below, I am providing the inference code I used for your reference. I am eagerly looking forward to your response. Thank you for your assistance. :)

batch_size = 16
metrics = ['bleurt']

import llm_blender
import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import copy
from llm_blender.common.evaluation import overall_eval
import datasets
import json

mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)
mixinstruct_test = list(mixinstruct_test)
# remove cmp_results with none cmp results
for ex in mixinstruct_test:
    ex['cmp_results'] = json.loads(ex['cmp_results'])
mixinstruct_test = [x for x in mixinstruct_test if x['cmp_results']]
inputs = [x['input'] for x in mixinstruct_test]
candidates_texts = [[cand['text'] for cand in x['candidates']] for x in mixinstruct_test]
targets = [x['output'] for x in mixinstruct_test]



ranker_config = llm_blender.RankerConfig
ranker_config.ranker_type = "pairranker"
ranker_config.model_type = "deberta"
ranker_config.model_name = "microsoft/deberta-v3-large"
ranker_config.load_checkpoint = "./hf_models/pairranker-deberta-v3-large"
ranker_config.cache_dir = "./hf_models"
ranker_config.source_max_length = 128
ranker_config.candidate_max_length = 128
ranker_config.n_tasks = 1
fuser_config = llm_blender.GenFuserConfig
fuser_config.model_name = "llm-blender/gen_fuser_3b"
fuser_config.cache_dir = "./hf_models"
fuser_config.max_length = 512
fuser_config.candidate_max_length = 128
blender_config = llm_blender.BlenderConfig
blender_config.device = "cuda"
blender = llm_blender.Blender(blender_config, ranker_config, fuser_config)
results = {}
fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, return_scores=False, batch_size=batch_size, top_k=3)

print("start evaluation...")

scores = overall_eval(fuse_generations, targets, metrics)
print(f"Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in mixinstruct_test[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}

for ex in mixinstruct_test:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", "{:.4f}".format(np.mean(llm_scores)))

name 'vllm' is not defined

Trying to run the inference notebook. Getting the following error: