sail-sg / lorahub Goto Github PK

View Code? Open in Web Editor NEW

570.0 570.0 35.0 281 KB

[COLM 2024] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

License: MIT License

Python 98.65% Shell 1.35%

lorahub's People

Stargazers

Watchers

lorahub's Issues

Release of Code for Finetuning Lora modules of FLAN and evaluating the Finetuned Models.

Hi, thank you for your good work!

I was wondering if you could provide the code used to Finetune these Lora modules on the flan datasets, along with the flan data loading scripts, the validation data used to tune hyperparameters, and perform the model selection. It would be very useful to everyone in general.

Thanks,
Prateek

Evaluation using Glue Lorahub Modules

Hi! I am trying to use the lorahub modules trained on Glue cola dataset, but it seems like it is not producing the correct output? Running the following code gives an output of "The next one is more accurate." rather than the expected label "1".

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
model = PeftModel.from_pretrained(model, "lorahub/flan_t5_large-glue_cola")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")

input_text = "Our friends won't buy this analysis, let alone the next one we propose."
input_ids = tokenizer(input_text, truncation=True, padding=True,return_tensors="pt").input_ids
outputs = model.generate(inputs=input_ids)
print(tokenizer.decode(outputs[0]))

How to evaluate glue lora?

Could you please provide an example of how to load a model, instantiate it, and apply a LoRA (Low-Rank Adaptation) using Hugging Face? Specifically, I'm interested in seeing how to download the LoRA and integrate it with a model instance, similar to the examples provided by Hugging Face

I would like to reproduce the glue datasets performance by just doing inference to see with my own eyes.

thank you for the great job

No single LoRA model can score on Disambiguation

Description

I tested almost all the lora models on the hugging face lorahub, but no single lora can score on Disambiguation subset. Is it really possible that merging weights makes the miracle happend? And without the Disambiguation subset, all the lora models are pretty similar to the flan-t5-large result on BBH.
Some lora can not be loaded as the filename is different from hugging face model id. I will fix that problem later but the current result is strange.

Code

import threading
import os
from lorahub.constant import LORA_MODULE_NAMES
def run(device:int):
    for i, lora in enumerate(LORA_MODULE_NAMES):
        if i%8!=device:
            continue
        loraname = lora.split('/')[-1]
        print(f'thread:{device} is testing {loraname}')
        os.system(f'CUDA_VISIBLE_DEVICES={device} python reproduce_bbh.py >log/{loraname}.log 2>log/{loraname}.err '+
                  f'--lora_model /data/sdb/sunyuhan/exp/lorahub_models/{loraname}')
    print(device)
threads = [threading.Thread(target=run, args=[i])for i in range(8)]
for t in threads:
    t.start()
for t in threads:
    t.join()
print("Thread has finished.")

TOP5 LoRA vs flan-T5-large

Look at the radar chart, all the models are similar and none can score on Disambiguation subset:

How to use local LoRA trained with train_model.py

Hi team!

I trained my own LoRA with train_model.py. Then I have a folder called 'demo' which contains my LoRA model (located in /lorahub/demo) . If I would like to use it in the process of lorahub_learning, should I add 'lorahub/demo' to LORA_MODULE_NAMES? I have tried but failed somehow.

Hope to get your guidance.

Thanks!
Fanjunduo Wei

Request for Guidance on Reproducing Experiments for BigBenchHard

Description

I am currently attempting to reproduce the experiment results outlined in your paper using the Flan-t5-large model on the BBH task. I've encountered a deviation in the results and I'm seeking assistance in addressing this issue.

Brief summary of my situation

Experiment Details

I utilized the script available at this GitHub repository to conduct the evaluation.

Discrepancy

My obtained score for Flan-t5-large on the BBH task was 31.82, whereas your paper reports a score of 27 for the same model-task combination.

Closure of Cot

To ensure consistency, I made sure to set ntrain=0 as instructed.

Request

I kindly ask for your guidance in reproducing the experiment results as presented in your paper. It would be immensely helpful if you could provide some insights, suggestions, or clarifications to assist me in achieving consistent results.

I greatly appreciate your time and consideration in helping me address this matter. Your expertise will go a long way in resolving this discrepancy and advancing my understanding of the experiment.

Thank you in advance for your assistance!

Log of Evaluation

[2023-08-15 14:55:45,077] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
{'data_dir': 'lukaemon/bbh', 'ntrain': 0, 'kwargs': {'model_name': 'seq_to_seq', 'model_path': '/data/sdb/sunyuhan/exp/flan-t5-large'}, 'args': Namespace(data_dir='lukaemon/bbh', ntrain=0, kwargs={'model_name': 'seq_to_seq', 'model_path': '/data/sdb/sunyuhan/exp/flan-t5-large'}), 'model': SeqToSeqModel(model_path='/data/sdb/sunyuhan/exp/flan-t5-large', max_input_length=2048, max_output_length=32, model=None, tokenizer=None, lora_path='', device='cuda', load_8bit=False, do_sample=False)}
{'prompt': 'not ( True ) and ( True ) is\nAnswer:', 'label': 'False', 'pred': 'not'}
{'name': 'boolean_expressions', 'score': 0.504}
{'prompt': 'How would a typical person answer each of the following questions about causation?\nA machine is set up in such a way that it will short circuit if both the black wire and the red wire touch the battery at the same time. The machine will not short circuit if just one of these wires touches the battery. The black wire is designated as the one that is supposed to touch the battery, while the red wire is supposed to remain in some other part of the machine. One day, the black wire and the red wire both end up touching the battery at the same time. There is a short circuit. Did the black wire cause the short circuit?\nOptions:\n- Yes\n- No\nAnswer:', 'label': 'No', 'pred': 'Yes'}
{'name': 'causal_judgement', 'score': 0.5989304812834224}
{'prompt': 'Today is Christmas Eve of 1937. What is the date tomorrow in MM/DD/YYYY?\nOptions:\n(A) 12/11/1937\n(B) 12/25/1937\n(C) 01/04/1938\n(D) 12/04/1937\n(E) 12/25/2006\n(F) 07/25/1937\nAnswer:', 'label': '(B)', 'pred': '(A)'}
{'name': 'date_understanding', 'score': 0.22}
{'prompt': 'In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous.\nSentence: The patient was referred to the specialist because he had a rare skin condition.\nOptions:\n(A) The patient had a skin condition\n(B) The specialist had a skin condition\n(C) Ambiguous\nAnswer:', 'label': '(A)', 'pred': 'The patient had a skin condition'}
{'name': 'disambiguation_qa', 'score': 0.0}
{'prompt': 'Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: [ [\nAnswer:', 'label': '] ]', 'pred': ']'}
{'name': 'dyck_languages', 'score': 0.0}
{'prompt': '"Here comes a perfectly valid argument: First of all, whoever is a schoolmate of Sondra is not a stepsister of Pricilla. In consequence, whoever is not a stepsister of Pricilla is a schoolmate of Sondra."\nIs the argument, given the explicitly stated premises, deductively valid or invalid?\nOptions:\n- valid \n- invalid\nAnswer:', 'label': 'invalid', 'pred': 'valid'}
{'name': 'formal_fallacies', 'score': 0.572}
{'prompt': 'This SVG path element <path d="M 55.57,80.69 L 57.38,65.80 M 57.38,65.80 L 48.90,57.46 M 48.90,57.46 L 45.58,47.78 M 45.58,47.78 L 53.25,36.07 L 66.29,48.90 L 78.69,61.09 L 55.57,80.69"/> draws a\nOptions:\n(A) circle\n(B) heptagon\n(C) hexagon\n(D) kite\n(E) line\n(F) octagon\n(G) pentagon\n(H) rectangle\n(I) sector\n(J) triangle\nAnswer:', 'label': '(B)', 'pred': '(A)'}
{'name': 'geometric_shapes', 'score': 0.172}
{'prompt': 'Which sentence has the correct adjective order:\nOptions:\n(A) midsize old grey Brazilian sweater\n(B) midsize grey Brazilian old sweater\nAnswer:', 'label': '(A)', 'pred': '(A) old, grey, sweater'}
{'name': 'hyperbaton', 'score': 0.596}
{'prompt': 'The following paragraphs each describe a set of five objects arranged in a fixed order. The statements are logically consistent within each paragraph. On a branch, there are five birds: a quail, an owl, a raven, a falcon, and a robin. The owl is the leftmost. The robin is to the left of the raven. The quail is the rightmost. The raven is the third from the left.\nOptions:\n(A) The quail is the rightmost\n(B) The owl is the rightmost\n(C) The raven is the rightmost\n(D) The falcon is the rightmost\n(E) The robin is the rightmost\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'logical_deduction_five_objects', 'score': 0.452}
{'prompt': 'The following paragraphs each describe a set of seven objects arranged in a fixed order. The statements are logically consistent within each paragraph. In a golf tournament, there were seven golfers: Ana, Eve, Ada, Dan, Rob, Amy, and Joe. Dan finished third. Ana finished above Ada. Amy finished last. Dan finished below Rob. Eve finished below Ada. Rob finished below Joe.\nOptions:\n(A) Ana finished third\n(B) Eve finished third\n(C) Ada finished third\n(D) Dan finished third\n(E) Rob finished third\n(F) Amy finished third\n(G) Joe finished third\nAnswer:', 'label': '(D)', 'pred': '(D)'}
{'name': 'logical_deduction_seven_objects', 'score': 0.456}
{'prompt': 'The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. On a branch, there are three birds: a blue jay, a quail, and a falcon. The falcon is to the right of the blue jay. The blue jay is to the right of the quail.\nOptions:\n(A) The blue jay is the second from the left\n(B) The quail is the second from the left\n(C) The falcon is the second from the left\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'logical_deduction_three_objects', 'score': 0.516}
{'prompt': 'Find a movie similar to Batman, The Mask, The Fugitive, Pretty Woman:\nOptions:\n(A) The Front Page\n(B) Maelstrom\n(C) The Lion King\n(D) Lamerica\nAnswer:', 'label': '(C)', 'pred': '(A)'}
{'name': 'movie_recommendation', 'score': 0.56}
{'prompt': '((-1 + 2 + 9 * 5) - (-2 + -4 + -4 * -7)) =\nAnswer:', 'label': '24', 'pred': '((-1 + 3) + 9 * 5) - (-4 + -6) + -23 * -7))'}
{'name': 'multistep_arithmetic_two', 'score': 0.0}
{'prompt': 'If you follow these instructions, do you return to the starting point? Always face forward. Take 1 step backward. Take 9 steps left. Take 2 steps backward. Take 6 steps forward. Take 4 steps forward. Take 4 steps backward. Take 3 steps right.\nOptions:\n- Yes\n- No\nAnswer:', 'label': 'No', 'pred': 'No'}
{'name': 'navigate', 'score': 0.576}
{'prompt': 'I have a flute, a piano, a trombone, four stoves, a violin, an accordion, a clarinet, a drum, two lamps, and a trumpet. How many musical instruments do I have?\nAnswer:', 'label': '8', 'pred': '20'}
{'name': 'object_counting', 'score': 0.288}
{'prompt': 'Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We now add a penguin to the table:\nJames, 12, 90, 12\nAnd here is a similar table, but listing giraffes:\nname, age, height (cm), weight (kg)\nJody, 5, 430, 620\nGladys, 10, 420, 590\nMarian, 2, 310, 410\nDonna, 9, 440, 650\nWhich is the oldest penguin?\nOptions:\n(A) Louis\n(B) Bernard\n(C) Vincent\n(D) Gwen\n(E) James\nAnswer:', 'label': '(E)', 'pred': '(E)'}
{'name': 'penguins_in_a_table', 'score': 0.3424657534246575}
{'prompt': 'On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?\nOptions:\n(A) zero\n(B) one\n(C) two\n(D) three\n(E) four\n(F) five\n(G) six\n(H) seven\n(I) eight\n(J) nine\n(K) ten\n(L) eleven\n(M) twelve\n(N) thirteen\n(O) fourteen\n(P) fifteen\n(Q) sixteen\nAnswer:', 'label': '(D)', 'pred': 'There are 1 mauve cat toy, 2 purple cat toys, 3 grey cat toys, 2 grey notebooks, 3 burgundy cat'}
{'name': 'reasoning_about_colored_objects', 'score': 0.292}
{'prompt': "Which of the following is a humorous edit of this artist or movie name: 'rain man'?\nOptions:\n(A) ruin man\n(B) rains man\n(C) rain men\n(D) rainmman\nAnswer:", 'label': '(A)', 'pred': '(D)'}
{'name': 'ruin_names', 'score': 0.236}
{'prompt': 'The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error.  Source: Karl Borromäus Joseph Fürst von Liechtenstein war ein kaiserlicher Feldmarschall.\nTranslation: Charles Borromeo Joseph Prince of Liechtenstein was an judicial field marshal.\nThe translation contains an error pertaining to\nOptions:\n(A) Modifiers or Adjectives\n(B) Numerical Values\n(C) Negation or Antonyms\n(D) Named Entities\n(E) Dropped Content\n(F) Facts\nAnswer:', 'label': '(A)', 'pred': '(C)'}
{'name': 'salient_translation_error_detection', 'score': 0.352}
{'prompt': "Which statement is sarcastic?\nOptions:\n(A) He's a generous person, trying to promote a charity stream that has raised millions to help kids in need\n(B) He's a terrible person, trying to promote a charity stream that has raised millions to help kids in need\nAnswer:", 'label': '(B)', 'pred': '(B)'}
{'name': 'snarks', 'score': 0.550561797752809}
{'prompt': 'Is the following sentence plausible? "Elias Lindholm beat the buzzer."\nAnswer:', 'label': 'no', 'pred': 'yes'}
{'name': 'sports_understanding', 'score': 0.552}
{'prompt': 'Today, Susan went to the coffee shop. Between what times could they have gone?\nWe know that:\nSusan woke up at 7am.\nLinda saw Susan driving to the water park from 7am to 11am.\nJohn saw Susan buying clothes at the mall from 11am to 12pm.\nJessica saw Susan taking photos near the Eiffel Tower from 12pm to 1pm.\nSteven saw Susan buying lunch at the deli from 1pm to 2pm.\nThomas saw Susan reading at the library from 2pm to 6pm.\nThe coffee shop was closed after 9pm.\nBetween what times could Susan have gone to the coffee shop?\nOptions:\n(A) 6pm to 9pm\n(B) 7am to 11am\n(C) 1pm to 2pm\n(D) 2pm to 6pm\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'temporal_sequences', 'score': 0.22}
{'prompt': 'Alice, Bob, Claire, Dave, and Eve are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Patrick, Bob is dancing with Sam, Claire is dancing with Jamie, Dave is dancing with Lola, and Eve is dancing with Melissa.\nThroughout the song, the dancers often trade partners. First, Dave and Eve switch partners. Then, Dave and Alice switch partners. Then, Eve and Alice switch partners. Then, Claire and Bob switch partners. Finally, Dave and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Patrick\n(B) Sam\n(C) Jamie\n(D) Lola\n(E) Melissa\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'tracking_shuffled_objects_five_objects', 'score': 0.12}
{'prompt': 'Alice, Bob, Claire, Dave, Eve, Fred, and Gertrude are on the same team in a soccer match. At the start of the match, they are each assigned to a position: Alice is playing striker, Bob is playing right winger, Claire is playing left winger, Dave is playing benchwarmer, Eve is playing goalkeeper, Fred is playing center midfielder, and Gertrude is playing cheerleader.\nAs the game progresses, pairs of players occasionally swap positions. First, Eve and Claire trade positions. Then, Gertrude and Alice trade positions. Then, Fred and Bob trade positions. Then, Dave and Fred trade positions. Then, Fred and Bob trade positions. Then, Bob and Eve trade positions. Finally, Claire and Alice trade positions. At the end of the match, Gertrude is playing\nOptions:\n(A) striker\n(B) right winger\n(C) left winger\n(D) benchwarmer\n(E) goalkeeper\n(F) center midfielder\n(G) cheerleader\nAnswer:', 'label': '(A)', 'pred': '(G)'}
{'name': 'tracking_shuffled_objects_seven_objects', 'score': 0.084}
{'prompt': 'Alice, Bob, and Claire are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Ulysses, Bob gets Frankenstein, and Claire gets Lolita.\nAs the semester proceeds, they start trading around the new books. First, Claire and Bob swap books. Then, Bob and Alice swap books. Finally, Claire and Bob swap books. At the end of the semester, Bob has\nOptions:\n(A) Ulysses\n(B) Frankenstein\n(C) Lolita\nAnswer:', 'label': '(B)', 'pred': '(B)'}
{'name': 'tracking_shuffled_objects_three_objects', 'score': 0.332}
{'prompt': 'Question: Sherrie tells the truth. Vernell says Sherrie tells the truth. Alexis says Vernell lies. Michaela says Alexis tells the truth. Elanor says Michaela tells the truth. Does Elanor tell the truth?\nAnswer:', 'label': 'No', 'pred': 'yes'}
{'name': 'web_of_lies', 'score': 0.0}
{'prompt': 'Sort the following words alphabetically: List: syndrome therefrom\nAnswer:', 'label': 'syndrome therefrom', 'pred': 'therefrom'}
{'name': 'word_sorting', 'score': 0.0}
{'average': 0.31822066786892184}
{'bbh': 31.82}
bbh: 31.82

can lorahub be used in non llm tasks

Hello Author,
I just wanted to ask if lorahub can be used on non llm tasks, for example in things like text to speech in vits for example https://github.com/jaywalnut310/vits .

Thanks in advance.

Where is the instruction of each lora in lorahub?

A great work!
I wonder where is the instruction text of each lora in lorahub?

FLAN datasets not available on Huggingface.

Hi @SivilTaram, thank you for sharing the training scripts!

I was trying to download the FLAN datasets using the script. Seems like the datasets tag "conceptofmind/FLAN_2022" is not available on hugging face anymore. Would it be possible for you to upload the data you have to huggingface or share a copy privately with me? This is critical for reproducing some of the experiments on my end.

Thanks,
Prateek

train_model model load repaired

In the code train_model.py:
model = AutoModelForSeq2SeqLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
if anyone could not use this to initial a flan-t5 model from AutoModelForSeq2SeqLM, then you need to add the following in params:
unk_token="",
bos_token="",
eos_token=""
Thanks!

About LoraHub adapters

Nice work!

I needed clarification about the adapters you released on HF; the base model is flan-t5, and the adapters are trained on some Flan tasks! Am I mistaking something, or they're taught on the seen data? e.g., the ANLI task is a Flan2021 task(which probably flan-t5 has seen) and is included in your hub.

Also, are you planning to release the code to reproduce the numbers in your paper in the near future?

Thanks.

404

https://huggingface.co/datasets/conceptofmind/FLAN_2022

About training examples

Hi, if using Chinese, how many examples does it need to train this lora?

Also ,will the lora effect original model ablility heavily?

AttributeError: 'GenerationConfig' object has no attribute 'cache_implementation'

@mavenlin @SivilTaram @P2333 @chenxwh @Boyu-Mi i want to evaluate my pruned llama-7b model (saved by torch.save), but the following error happend:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/iotsc01/xinpengq/LLM-Pruner-main/lorahub-main/reproduce_bbh.py:181 in │
│ │
│ 178 │ │ # unzip │
│ 179 │ │ os.system("unzip data_bbh.zip") │
│ 180 │ # evaluate the model │
│ ❱ 181 │ evaluate_flan_results_zero_shot("data_bbh", args.ckpt) │
│ 182 │ # # five shot for flan models │
│ 183 │ # evaluate_flan_results_few_shot("data_bbh", "google/flan-t5-large") │
│ 184 │ # # five shot for lorahub models │
│ │
│ /home/iotsc01/xinpengq/LLM-Pruner-main/lorahub-main/reproduce_bbh.py:63 in │
│ evaluate_flan_results_zero_shot │
│ │
│ 60 │ │ │ │ return_tensors="pt", │
│ 61 │ │ │ │ padding=True, │
│ 62 │ │ │ ).to(model.device) │
│ ❱ 63 │ │ │ outputs = model.generate( │
│ 64 │ │ │ │ input_ids=inputs["input_ids"], max_new_tokens=5 │
│ 65 │ │ │ ) │
│ 66 │ │ │ outputs = tokenizer.batch_decode( │
│ │
│ /home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/torch/utils/_contextlib.p │
│ y:115 in decorate_context │
│ │
│ 112 │ @functools.wraps(func) │
│ 113 │ def decorate_context(*args, **kwargs): │
│ 114 │ │ with ctx_factory(): │
│ ❱ 115 │ │ │ return func(*args, **kwargs) │
│ 116 │ │
│ 117 │ return decorate_context │
│ 118 │
│ │
│ /home/iotsc01/anaconda3/envs/xinpengq_env/lib/python3.10/site-packages/transformers/generation/u │
│ tils.py:1456 in generate │
│ │
│ 1453 │ │ │
│ 1454 │ │ # if we don't pass past_key_values and a cache_implementation is specified │
│ 1455 │ │ │
│ ❱ 1456 │ │ if generation_config.cache_implementation in NEED_SETUP_CACHE_CLASSES_MAPPING an │
│ 1457 │ │ │ "past_key_values", False │
│ 1458 │ │ ): │
│ 1459 │ │ │ cache_cls = NEED_SETUP_CACHE_CLASSES_MAPPING[generation_config.cache_impleme │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'GenerationConfig' object has no attribute 'cache_implementation'

how can i solve it? the following is my code:

import argparse
import sys
sys.path.append("../LLaMA-Factory-main/src")

import torch
from LLMPruner.peft import PeftModel
from lorahub.algorithm import load_dataset, lorahub_inference
import os
import json
from lorahub.algorithm import lorahub_learning, lorahub_inference
from lorahub.constant import LORA_MODULE_NAMES
import random
from random import shuffle
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer,AutoModel,AutoTokenizer


def evaluate_flan_results_zero_shot(folder, model_path, **kwargs):
    sub_dirs = os.listdir(folder)
    pruned_dict = torch.load(model_path, map_location='cpu')
    tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']
    model.config.pad_token_id = tokenizer.pad_token_id = 0
    model.config.bos_token_id = 1
    model.config.eos_token_id = 2
    res = [0, 0]
    for sub_dir in sub_dirs:
        test_file_path = os.path.join(folder, sub_dir, "zero_shot.jsonl")
        task_inputs, task_outputs = [], []
        for line in open(test_file_path, "r", encoding="utf-8"):
            example = json.loads(line)
            task_inputs.append(example["context"])
            task_outputs.append(example["completion"])
        print("Evaluating on task (zero shot): ", sub_dir)
        # _,task_perf = lorahub_inference(task_inputs,
        #                   model,
        #                   tokenizer,
        #                   16,
        #                   task_outputs)
        
        def accuracy_score(outputs, ground_truths):
            correct = 0
            total = 0
            for output, truth in zip(outputs, ground_truths):
                # if output.strip().lower().replace(".", "") == truth.strip().lower().replace(".", ""):
                if truth.strip().replace(".", "") in output.strip().replace(".", ""):
                    correct += 1
                total += 1
            return correct / total * 100

        example_predictions = []
           
        # process dataset
        dataset = load_dataset(task_inputs, task_outputs, tokenizer)
        # use gpu if available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model = model.to(device)
        for i in range(0, len(dataset["input"]), 4):
            inputs = tokenizer(
                dataset["input"][i : i + 4],
                max_length=2048,
                return_tensors="pt",
                padding=True,
            ).to(model.device)
            outputs = model.generate(
                input_ids=inputs["input_ids"], max_new_tokens=5
            )
            outputs = tokenizer.batch_decode(
                outputs.to("cpu"), skip_special_tokens=True
            )
            example_predictions.extend(outputs)
        for i in range(len(example_predictions)):
            index = example_predictions[i].find("A:")
            if index != -1:
                example_predictions[i] = example_predictions[i][index+2:]
            
        if task_outputs is not None:
            task_perf = accuracy_score(example_predictions, task_outputs)
        else:
            task_perf = None
        print("***",task_perf,"***")
        print(example_predictions)
        res[0] += 1
        res[1] += task_perf
        torch.cuda.empty_cache()
    print("result avg:",res[1]/res[0])

def evaluate_flan_results_few_shot(folder, flan_model_name):
    sub_dirs = os.listdir(folder)

    for sub_dir in sub_dirs:
        test_file_path = os.path.join(folder, sub_dir, "few_shot.jsonl")
        task_inputs, task_outputs = [], []
        for line in open(test_file_path, "r", encoding="utf-8"):
            example = json.loads(line)
            task_inputs.append(example["context"])
            task_outputs.append(example["completion"])
        print("Evaluating on task (five shot): ", sub_dir)
        lorahub_inference(task_inputs,
                          flan_model_name,
                          flan_model_name,
                          16,
                          task_outputs)


def evaluate_lorahub_results_few_shot(folder, flan_model_name):
    sub_dirs = os.listdir(folder)

    # 5 seeds used in our experiments
    for sub_dir in sub_dirs:
        # construct the few-shot examples for lorahub learning
        example_inputs, examples_outputs = [], []
        example_file_path = os.path.join(folder, sub_dir, "example.jsonl")
        for line in open(example_file_path, "r", encoding="utf-8"):
            example = json.loads(line)
            example_inputs.append(example["context"])
            examples_outputs.append(example["completion"])
            
        # random select 5 examples for each task
        random.seed(42)
        shuffled_set = list(zip(example_inputs, examples_outputs))
        random.shuffle(shuffled_set)
        example_inputs, examples_outputs = zip(*shuffled_set)
        # take the first 5 examples
        example_inputs, examples_outputs = example_inputs[:5], examples_outputs[:5]

        # load the zero-shot examples for evaluation
        test_file_path = os.path.join(folder, sub_dir, "zero_shot.jsonl")
        task_inputs, task_outputs = [], []
        for line in open(test_file_path, "r", encoding="utf-8"):
            example = json.loads(line)
            task_inputs.append(example["context"])
            task_outputs.append(example["completion"])

        task_perf_list = []
        for seed in range(1, 6):
            random.seed(seed)

            def get_lora_module_list():
                return random.sample(LORA_MODULE_NAMES, 20)
            # get a list of modules to be used in the composition
            modules = get_lora_module_list()

            # perform LoRAHub learning
            module_weights, model, tokenizer = lorahub_learning(lora_module_list=modules,
                                                                example_inputs=example_inputs,
                                                                example_outputs=examples_outputs,
                                                                max_inference_step=40,
                                                                batch_size=5)

            print("module_weights:", module_weights)

            """
            Perform inference to get predictions
            """
            _, task_acc = lorahub_inference(example_inputs=task_inputs,
                                            model_or_name_path=model,
                                            tokenizer_or_tokenizer_path=tokenizer,
                                            batch_size=10,
                                            # can set as None if you do not have the ground truth
                                            example_outputs=task_outputs)
            task_perf_list.append(task_acc)
        avg_perf, max_perf = sum(task_perf_list) / len(task_perf_list), max(task_perf_list)
        print("average perf:", avg_perf, "best perf:", max_perf)


if __name__ == "__main__":
    
    parser = argparse.ArgumentParser(description='Tuning Pruned LLaMA (huggingface version)')
    parser.add_argument('--base_model', type=str, default="/home/iotsc01/LLM-Pruner-main/llama-7b-hf", help='base model name')
    parser.add_argument('--ckpt', type=str, default=None, help='pruned model path')
    parser.add_argument('--lora_ckpt', type=str, default=None)
    parser.add_argument('--max_seq_len', type=int, default=128, help='max sequence length')
    parser.add_argument('--test_mod', type=str, default="pruned", help='choose from [pruned, tuned, base]')
    args = parser.parse_args()
    
    if not os.path.exists("data_bbh"):
        # download dataset
        os.system("wget https://github.com/sail-sg/lorahub/releases/download/0.1/data_bbh.zip")
        # unzip
        os.system("unzip data_bbh.zip")
    # evaluate the model
    evaluate_flan_results_zero_shot("data_bbh", args.ckpt)
    # # five shot for flan models
    # evaluate_flan_results_few_shot("data_bbh", "google/flan-t5-large")
    # # five shot for lorahub models
    # evaluate_lorahub_results_few_shot("data_bbh", "google/flan-t5-large")

Is there any scalability issue?

Very interesting work! I have two questions and hope to get answer:

Does the performance improve as the number of LoRA models increases?
Have you conducted theoretical tests or experiments to validate the concept of LoraHub?

Discussing LoraHub: Exploration, Implementation, and Potential Improvements

LoraHub is a really great idea, similar to a few ideas I thought of yesterday.

Unlike MOE, instead of training many domain experts, it trains multiple Loras on a large base model.
During inference, a router mechanism is used to select which Lora weights to combine for inference. Only one base model is needed for deployment. Like a chain of trees, if you infer several times, you can achieve better performance.
The training parameters and data for Lora can be made more aggressive, ready to scale up. For example, a 65B base model, trained on high-quality data from 8 different domains, separately trains 8 1B Loras. Has anyone compared whether its performance is better or worse than MOE?
It is not yet very clear which base models were chosen in the paper, how the training parameters were, how the Loras were merged for inference, and many other details. I am waiting for the code to be published for more details.
How to cleverly design the router mechanism is also worth researching and discussing. Are there any related materials to recommend?

Downloading the FLAN-v2 dataset

I am having a hard time downloading the flanv2 dataset provided in this repo.

The git clone command just downloads the metadata for each dataset as shown below.

Using huggingface load_dataset downloads a single unified dataset that comes from the P3 dataset (source).
Huggingface CLI method seems to have authentication issues.

I could manually go through the huggingface flanv2 dataset repo and download each dataset lfs, but that is time-consuming.

Is there any graceful way to download the datasets used?

repaired an error in reproduce_bbh.py (shuffle seed)

Hi there, i found i could not use shuffle with seed, if anyone meets this problem, you can use the following codes to replace the original one, which is around line62:
random.seed(42)
shuffled_set = list(zip(example_inputs, examples_outputs))
random.shuffle(shuffled_set)
example_inputs, examples_outputs = zip(*shuffled_set)

reproduce_bbh.py

Bug Report - (batched option & tensor shape)

Description

I have met some bugs while running python example.py

1. batched problem

If setting batched=True at algorithm.py: 105 by default in your code, the exception below raised:

Traceback (most recent call last):
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 239, in <module>
    main()
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 211, in main
    module_weights, model, tokenizer = lorahub_learning(lora_module_list=modules,
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 264, in lorahub_learning
    dataset = load_dataset(example_inputs, example_outputs, tokenizer) 
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 103, in load_dataset
    processed_datasets = dataset.map(
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2092, in map
    return self._map_single(
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 485, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/fingerprint.py", line 411, in wrapper
    out = func(self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2361, in apply_function_on_filtered_inputs
    validate_function_output(processed_inputs, indices)
  File "/root/miniconda3/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2340, in validate_function_output
    raise TypeError(
TypeError: Provided `function` which is applied to all elements of table returns a `dict` of types [<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>]. When using `batched=True`, make sure provided `function` returns a `dict` of types like `(<class 'list'>, <class 'numpy.ndarray'>)`.

Solution : set batched=False:

processed_datasets = dataset.map(
    preprocess_func_with_tokenizer,
  ->batched=False,
    num_proc=1,
    desc="Running tokenizer on dataset",
)

2. padding problem

All the padding value is True currently, and the exception below raised:

Traceback (most recent call last):
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 239, in <module>
    main()
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 211, in main
    module_weights, model, tokenizer = lorahub_learning(lora_module_list=modules,
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 279, in lorahub_learning
    recommendation = optimizer.minimize(get_score_partial, verbosity=1)
  File "/root/miniconda3/lib/python3.10/site-packages/nevergrad/optimization/base.py", line 671, in minimize
    result = job.result()
  File "/root/miniconda3/lib/python3.10/site-packages/nevergrad/optimization/utils.py", line 137, in result
    self._result = self.func(*self.args, **self.kwargs)
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 165, in get_score
    loss = get_loss(example_dataset, model, batch_size)
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 127, in default_get_loss
    for _, batch in enumerate(train_dataloader):
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/data/data_collator.py", line 70, in default_data_collator
    return torch_default_data_collator(features)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/data/data_collator.py", line 136, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 81 at dim 2 (got 90)

Solution : set all paddings in algorithm.py to 'max_length':

inputs = tokenizer(
    dataset["input"][i : i + batch_size],
    max_length=2048,
    return_tensors="pt",
  ->padding='max_length',
).to(device)

3. tensor dimension problem

The shape of input has now became [5, 1, 2048] due to the modifies before:

(Pdb) p batch['input_ids'].shape
torch.Size([5, 1, 2048])

And it will cause shape error:

Traceback (most recent call last):
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 239, in <module>
    main()
  File "/data/sdb/sunyuhan/exp/lorahub/example.py", line 211, in main
    module_weights, model, tokenizer = lorahub_learning(lora_module_list=modules,
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 280, in lorahub_learning
    recommendation = optimizer.minimize(get_score_partial, verbosity=1)
  File "/root/miniconda3/lib/python3.10/site-packages/nevergrad/optimization/base.py", line 671, in minimize
    result = job.result()
  File "/root/miniconda3/lib/python3.10/site-packages/nevergrad/optimization/utils.py", line 137, in result
    self._result = self.func(*self.args, **self.kwargs)
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 166, in get_score
    loss = get_loss(example_dataset, model, batch_size)
  File "/data/sdb/sunyuhan/exp/lorahub/lorahub/algorithm.py", line 131, in default_get_loss
    outputs = model(**batch)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 1080, in forward
    return self.base_model(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1678, in forward
    encoder_outputs = self.encoder(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 986, in forward
    batch_size, seq_length = input_shape
ValueError: too many values to unpack (expected 2)

Solution : squeeze the unused dimension at algorithm.py:128:

with torch.no_grad():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    for _, batch in enumerate(train_dataloader):
      ->batch = {k: v.squeeze(1).to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        train_loss += loss.detach().float()

Summarization

All the bugs maybe caused by setting batched=False, but I can not set the batched=True due to unknown reason. And after above 3 modifies I can successfully run the example.py.

Using with CausalLM models

Hi, thanks for the great library! Can the LoRA composition and weight learning approach be used with LoRA modules trained from AutoModelForCausalLM? If so, do you have any supporting scripts etc.

Thanks!

Training Tasks and Data for these LoRA modules?

Hello, LoRAHub is a very interesting work!
I have the following two questions about the LoRA modules mentioned in the paper:
(1) According to the original paper, the FLAN-2022 Collection used for Flan-T5 instruction tuning consists of as many as 1836 tasks. However, in LoRAHub, it seems that only 196 tasks were used to train the LoRA modules. Can you please clarify if there is any misunderstanding here or if you selected these 196 tasks from the pool of 1836 tasks?
(2) LoRAHub trained each LoRA module using the same tasks as Flan-T5 for fairness. However, it seems that Flan-T5 sampled train examples from different tasks during training and did not use all the samples. Does this mean that even though LoRAHub's LoRA modules use the same tasks as Flan-T5, they may still use samples that Flan-T5 has not seen?

Poor Performance of flan-t5-large Model on MMLU Valid Dataset with Mathematical LoRA

Overview

I have been using the flan-t5-large lora model math_dataset_algebra__linear_1d, which is designed for mathematical tasks, to process and generate results on the MMLU valid dataset. However, the performance of the model on this dataset has been disappointing and the results obtained are not up to the expected standard. Here is the url of the lora: https://huggingface.co/lorahub/flan_t5_large-math_dataset_algebra__linear_1d

Problem Statement

The main issue I'm facing is that despite lora being specifically designed for mathematical tasks, its performance on the MMLU valid dataset is not good as expected.

Steps to Reproduce

Prepare the MMLU valid dataset.
Load and initialize the flan-t5-large model with lora math_dataset_algebra__linear_1d
Input data from the MMLU valid dataset into the model.
Generate results/output using the model.
Evaluate the performance of the model on the MMLU valid dataset.

Expected Behavior

As lora is optimized for mathematical tasks, I expected it to perform well on the MMLU valid dataset. The outputs generated by the model should align with the ground truth data from the dataset, demonstrating accurate and meaningful results.

Experiment Result

Contrary to expectations, the flan-t5-large model's performance on the MMLU valid dataset has been disappointing. The generated outputs often contain errors and inaccuracies, making them unsuitable for my intended use case.

original flan_t5

Subject	Average Accuracy
Abstract Algebra	0.250
College Mathematics	0.280
Elementary Mathematics	0.280
High School Mathematics	0.296
High School Statistics	0.319
Math	0.289

math_dataset_algebra__linear_1d

Subject	Average Accuracy
Abstract Algebra	0.240
College Mathematics	0.300
Elementary Mathematics	0.267
High School Mathematics	0.289
High School Statistics	0.329
Math	0.286

sail-sg / lorahub Goto Github PK

lorahub's People

Stargazers

Watchers

Forkers

lorahub's Issues

Description

Code

TOP5 LoRA vs flan-T5-large

Description

Brief summary of my situation

Experiment Details

Discrepancy

Closure of Cot

Request

Log of Evaluation

Description

1. batched problem

2. padding problem

3. tensor dimension problem

Summarization

Overview

Problem Statement

Steps to Reproduce

Expected Behavior

Experiment Result

Recommend Projects

Recommend Topics

Recommend Org