r-three / t-few Goto Github PK

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

License: MIT License

Shell 3.61% Python 94.44% C++ 0.72% Cuda 1.24%

t-few's Introduction

T-Few

This repository contains the official code for the paper: "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning".

This method outperforms in-context learning with GPT-3 and achieves state-of-the-art on "RAFT".

Setup

First, create a virtual environment for the project and install all the requirments. (We use conda to manage environments. Be sure to install and initialize conda first.)

Create a virtual environment with python 3.7 conda create -n tfew python==3.7, then activate the environment conda activate tfew.
Install other dependencies. pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
If you plan to run SAID, then install dependencies with python src/intrinsic_said_setup.py develop. Otherwise, skip this step.

The steps above only needs to be done once. In addition, every time you start a new session, you will need to run . bin/start.sh

Run your first experiment

Once you finished setting up the environment, you can try running CUDA_VISIBLE_DEVICES=3 python -m src.pl_train -c t0.json+rte.json -k save_model=False exp_name=first_exp The outputs of this run will be saved to ${OUTPUT_PATH}/first_exp/, which is usually /t-few/exp_out/first_exp/. Here, first_exp is the experiment name, you can run more experiments with different expeirment names. The code will automatically skip finished experiments. (However, if you wish to rerun a finished experiment under the same experiment name, you will need to manually remove the corresponding files in the output directory.)

There are two ways to control an experiment.

You can specify config files with -c. Multiple config files can be combined with +. (When there are conflits, config terms from the config file on the right will have greater power.) This will be convinient when you have multiple terms that forms a fixed group.
You can override values with -k. This will be convinient when you need to change a small number of terms.

It is recommended to use GPUs with 40GB to train T0(3B) and 80GB to train T0

Run an array of experiments

In this project, we often need to run a large number of experiments. Here is an example bash script bin/few-shot-pretrained-3b-100k.sh to fine-tune 3B pre-trained (IA)3 on all datasets.

This should take a few hours. After that, you can use scripts/get_results_table.py to generate a csv summary.

Citation

If you find this repo helpful, welcome to cite our work:

@article{liu2020tfew,
  title={Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning},
  author={Liu, Haokun and Tam, Derek and Muqeeth, Mohammed and Mohta, Jay and Huang, Tenghao and Bansal, Mohit and Raffel, Colin},
  journal={arXiv preprint arXiv:2205.05638},
  year={2022}
}

We use the following code in our works:

@article{mahabadi2021compacter,
  title={Compacter: Efficient low-rank hypercomplex adapter layers},
  author={Mahabadi, Rabeeh Karimi and Henderson, James and Ruder, Sebastian},
  journal={arXiv preprint arXiv:2106.04647},
  year={2021}
}

@article{sung2021training,
  title={Training Neural Networks with Fixed Sparse Masks},
  author={Sung, Yi-Lin and Nair, Varun and Raffel, Colin},
  journal={arXiv preprint arXiv:2111.09839},
  year={2021}
}

@article{aghajanyan2020intrinsic,
  title={Intrinsic dimensionality explains the effectiveness of language model fine-tuning},
  author={Aghajanyan, Armen and Zettlemoyer, Luke and Gupta, Sonal},
  journal={arXiv preprint arXiv:2012.13255},
  year={2020}
}

t-few's People

Contributors

Stargazers

Watchers

Forkers

dptam crazypython dumpmemory liujie40 ishine ntaylorox 00mjk sciai-ai oe-heart stefanhgm jianzhu smacawi sungjinl zhangxy-2019 ohadrubin metavai awebson reppertj setfit spreemohealth sxjscience pclucas14 pastelbelem8 harrylyf raldir stat-eklee ahmedmalaa vuiseng9 uppaal andst wliuxingxiangyu chhaviilli shuishen112 markhng525 anandanne varshakishore shuminghu duongdang trminhnam iongpt ztsin lucas-okamura jrubin01 decunde tampueroc chenyujiehome osebesammi token1029 fangarenotgnu codeaudit fanwangkath geraldwu23 doris404 sanjit-jeevanand abdd68

t-few's Issues

Releasing evaluation log probabilites

Hi, thanks for open-sourcing model code! Could you release the log probabilities for evaluation tasks (i.e., the model probabilities for valid answers for each prompt on each question for all evaluated datasets)? This data would allow for for fine-grained evaluation of models and comparing against other LLMs.

cf. facebookresearch/metaseq#25

Multi-GPU Support

Hello,

Have you tried training on Multi-GPU setup? I tried running your fine-tuning example like so:

export CUDA_VISIBLE_DEVICES=0,1
python -m src.pl_train -c t03b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t03b_ia3_finish.pt" exp_name=t03b_rte_seed42_ia3_pretrained100k few_shot_random_seed=42 seed=42

But I get errors in the lightning data loaders.

Any Ideas?
Thank you

Where are performance results of experiments stored

Hi,

thank you very much for sharing your code!

I ran the example from the readme and the parts of the few-shot-pretrained-3b-100k.sh script. However, the dev_scores.json for the readme example only contains the line:

{"accuracy": 0.6101083032490975, "score_gt": 0.3983679488032303, "score_cand": 0.6958685107394676}

And for t03b_copa_seed42_ia3_pretrained100k (the first experiment of few-shot-pretrained-3b-100k.sh):

{"accuracy": 0.85, "score_gt": 0.06061243396921782, "score_cand": 0.4640417302213609}

Those are just the results of the "Validation sanity check" right at the beginning, so I wondered where the validation results after each epoch are stored or am I missing something here?

Thanks!

Could your please give a detailed explanation for the "rank classification"?

Hi, thanks for your excellent job. I've read the paper the reviewed the code. I've encounter some issues as outlined following:

I'd appreciate a detailed explanation of how the "rank classification" is implemented. Could you please provide clarification on the code found at this link?
I'm curious about how the "rank classification" process influences the final results. Is it feasible to employ a direct generation approach, such as generating the label words and matching them against the true answer, as an alternative method?

AttributeError: Can't pickle local object 'create_collate_fn.<locals>.collate_fn'

When I tried to run the demo, I found this error! @dptam @jmohta @muqeeth

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
WARNING:datasets.builder:Reusing dataset super_glue (/Users/caffrey/Documents/research/t-few-master/cache/super_glue/rte/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)
Missing logger folder: exp_out/first_exp/log
WARNING:datasets.builder:Reusing dataset super_glue (/Users/caffrey/Documents/research/t-few-master/cache/super_glue/rte/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)
Train size 32
Eval size 277

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 2.8 B 
-----------------------------------------------------
2.8 B     Trainable params
0         Non-trainable params
2.8 B     Total params
11,399.029Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/caffrey/Documents/paper/t-few-master/src/pl_train.py", line 86, in <module>
    main(config)
  File "/Users/caffrey/Documents/paper/t-few-master/src/pl_train.py", line 57, in main
    trainer.fit(model, datamodule)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
    val_loop.run()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 199, in run
    self.on_run_start(*args, **kwargs)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 88, in on_run_start
    self._data_fetcher = iter(data_fetcher)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
    self.dataloader_iter = iter(self.dataloader)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 443, in __iter__
    return self._get_iterator()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 389, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1062, in __init__
    w.start()
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/caffrey/miniforge3/envs/tongji/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'create_collate_fn.<locals>.collate_fn'

Where are the loss function changes in the codebase?

From the paper: "As an objective, we use the sum of a standard language modeling loss, an unlikelihood loss for incorrect choices, and a length-normalized loss."

And the code also uses huggingface transformers. I was just wondering if you could point me to where the loss function is modified and then used in training in the codebase.

Make use of the model, datasets and strategy to classify sentences as urgent not urgent

Hello, I am wondering if I can use this code to classify sentences as urgent / not urgent.
Also if I could use even these datasets and model to accomplish this classification as urgent not urgent

save dev_pred.txt and test_pred.txt for RTE and ANLI

Congrats on your great work! I am interested in analyzing the results of T0-3B + IA3's predictions on NLI tasks. I run the command python -m src.pl_train -c t03b.json+anli-r3.json+ia3.json -k exp_name=anli-r3 load_weight="pretrained_checkpoints/t03b_ia3_finish.pt" eval_epoch_interval=20 but only see the dev_scores.json file in the output. How can I also obtain the prediction file of the model? Thanks!

Creation of the `decoder_attention_mask` while evaluating

Hi there,

I am trying to recreate the decoder attention mask and I am a bit puzzled by how it is created here

t-few/src/models/EncoderDecoder.py

Line 53 in 114dece

decoder_attention_mask = (decoder_input_ids == decoder_input_ids).float()

This creates a dense matrix with 1s everywhere. Shouldn't this be a lower triangular matrix (which is what T5Model does internally by default)?

Thanks a lot for your help!

IA3 implementation doesn't add parameters for feedforward layers

Hi,

I'm trying to implement your method (IA)³ for use with HuggingFace's PEFT library and had a question. In the paper, it is mentioned that the learned vectors in (IA)³ are added for all the position-wise feedfoward layers in the transformer, along with the various attention layers. I ran src/models/lora.py and used the config parameters in configs/ia3.json to check how the new model layers would be. The typical FeedForward module in T5 is a T5DenseActDense module that looks as follows:

          (DenseReluDense): T5DenseActDense(
          (wi): Linear(in_features=768, out_features=3072, bias=False)
          (wo): Linear(in_features=3072, out_features=768, bias=False)
          (dropout): Dropout(p=0.1, inplace=False)
          (act): ReLU()
          )

Since (IA)³ is implemented as an extension of LoRA, the Linear layers are supposed to get converted into LoRALinear layers. However, the config in ia3.json sets the parameter lora_layers to be "k|v|wi_1.*", which does not include the layers in DenseReluDense (These are attributes with names wi and wo). I've tried T5-small, T5-base and T5-3b, and for all of these models, learned vectors are not added for the feedforward layers. I was wondering if I'm doing something wrong or if I'm supposed to use a different config file. Or are (IA)³ parameters added only for certain feedforward layers?

How long it will take for pretraining the model using A100(80G)?

ImportError: cannot import name 'fast_walsh_hadamard_transform' from 'src.models.fwh_cuda' (unknown location)

I tried running the example from the README and got this error. Can you help?

$ CUDA_VISIBLE_DEVICES=3 python -m src.pl_train -c t0.json+rte.json -k save_model=False exp_name=first_exp
Traceback (most recent call last):
  File "/home/james/.conda/envs/tfew/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/james/.conda/envs/tfew/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/james/github/t-few/src/pl_train.py", line 10, in <module>
    from src.models.EncoderDecoder import EncoderDecoder
  File "/home/james/github/t-few/src/models/EncoderDecoder.py", line 11, in <module>
    from .intrinsic import intrinsic_plugin_on_step
  File "/home/james/github/t-few/src/models/intrinsic.py", line 10, in <module>
    from .fwh_cuda import fast_walsh_hadamard_transform as fast_walsh_hadamard_transform_cuda
ImportError: cannot import name 'fast_walsh_hadamard_transform' from 'src.models.fwh_cuda' (unknown location)

Clarification about IA^3

Hi :)

I was reading your interesting paper https://arxiv.org/pdf/2205.05638.pdf.

In Section 3.3, you specify that IA^3 adds a total of d_k + d_v + d_ff parameters.

However, if I look at this line, you seem to be allocating 2 * d vectors for each linear layer (multi_lora_a, multi_lora_b) and multiplying multi_lora_a with the input and multi_lora_b with the transformed input.

t-few/src/models/lora.py

Line 43 in 9dbc9cc

hidden = hidden * self.multi_lora_b.flatten()

Am I missing something?

Thank you for your clarification :-)

Does pl_train.py support TPU training?

Hello,
I am interested in using T-Few recipe for some experiments with Google Cloud TPUs. I am wondering whether the pl_train.py script supports TPU already? I read in the Acknowledgments section of the paper the authors cite that TPU cloud was utilised, however in this script I can see that gpu is directly supported. Any pointers will be appreciated, particularly I would like to use T0–11b

What is the meaning of score_gt and score_cand?

What is the meaning of score_gt and score_cand? How do I better run the model by observing these parameters?

To which epoch/training step does the finish.pt checkpoint belong to?

Hi everyone!

When I run the experiments after eval_epoch_interval's the model is validated and a checkpoint is written out as global_stepXXXXX.pt. At the end there is also a final checkpoint written out named finish.pt. I assumed this one either belongs to the best intermediate validation performance or the last epoch. However, from comparing it with the other checkpoints that were created it seems that finish.pt differs from all global_stepXXXXX.pt checkpoints, so I am wondering to which point in training does the finish.pt belong to?

Sorry if I miss something obvious here.

Best,
Stefan

Multi-task batching

In the paper, you mention that IA^3 is compatible with multi-task batching, a requirement to be comparable to ICL. Unfortunately, the current implementation of Huggingface PEFT does not support this, and it would apparently be a big refactoring to do so huggingface/peft#759.

Do you know of an implementation or example that shows how to do this?

KeyError: 'HF_HOME'

Hi!
I was trying to run the example in README, but it says KeyError: 'HF_HOME'
This is the script I used: python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp
I can't find anywhere in the code that sets the value of this environment variable.

Mark experiment first_exp as claimed
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Traceback (most recent call last):
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/weiqiuyou/Documents/codebases/t-few/src/pl_train.py", line 86, in <module>
    main(config)
  File "/Users/weiqiuyou/Documents/codebases/t-few/src/pl_train.py", line 57, in main
    trainer.fit(model, datamodule)
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1131, in _run
    self._data_connector.prepare_data()
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 154, in prepare_data
    self.trainer.datamodule.prepare_data()
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
    fn(*args, **kwargs)
  File "/Users/weiqiuyou/Documents/codebases/t-few/src/data/data_module.py", line 17, in prepare_data
    _ = self.dataset_reader.read_few_shot_dataset()
  File "/Users/weiqiuyou/Documents/codebases/t-few/src/data/dataset_readers.py", line 164, in read_few_shot_dataset
    orig_data = self.read_orig_dataset("train")
  File "/Users/weiqiuyou/Documents/codebases/t-few/src/data/dataset_readers.py", line 146, in read_orig_dataset
    orig_data = load_dataset(*self.dataset_stash, split=split, cache_dir=os.environ["HF_HOME"])
  File "/Users/weiqiuyou/opt/miniconda3/envs/tfew/lib/python3.7/os.py", line 678, in __getitem__
    raise KeyError(key) from None
KeyError: 'HF_HOME'

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

@HaokunLiu @dptam Thank you for your great work and congrats on the neurips acceptance!

I have got an issue when using ddp as follows:
AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

It's raised by the following line:

t-few/src/models/EncoderDecoder.py

Line 305 in 4e581fa

self.trainer.model.save_checkpoint(distributed_save_path)

Any suggestion would be appreciated!

Another related question is why the ddp ckpt also needs to be processed by zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)? I thought it should be applied to deepspeed zero ckpts only. This is done in:

t-few/src/models/EncoderDecoder.py

Line 308 in 4e581fa

    
           trainable_states = zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)

Can't run 11 billion model on A100 with 80GB

Hi @craffel @muqeeth @HaokunLiu,

We're trying to reproduce T-Few results for a paper, but we're getting 'CUDA out of memory' using an A100 with 80GB (your recommended setup).

This is what we're running:

python -m src.pl_train -c t011b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t011b_ia3_finish.pt" exp_name=t011b_rte_seed42_ia3_pretrained few_shot_random_seed=42 seed=42

We installed according to the README instructions and are using the default settings in the config files.
We are able to run the 3 billion model using the command above, just not the 11 billion.
Is there anything we are doing wrong?

This is the exception:

Thank you

question about intrinsic.py

Some context: in line 179 of the code, we have param.requires_grad_(False). I'm a bit confused why this needs to be set to false. When I try to reproduce this code in a different setting, my loss does not decrease. However, when param.requires_grad_(True), the loss does decrease. Either way, I'm unclear why it should matter because in the optimizer only intrinsic_parameter and intrinsic_said are being updated.

questions from your paper

Thanks for your great work. I have one question about your paper. Table 4 shows the results for all PEFT methods "without" pertaining. Right?

How is l_ff created?

Firstly, thank you for the amazing work! I had a question around the implementation of $l_{ff}$ in the (IA)³ method:

The config file for (IA)³ lists lora_layers as "k|v|wi_1.*"

t-few/configs/ia3.json

Line 6 in 4e581fa

"lora_layers": "k|v|wi_1.*",

However, when using this string to find model layers to modify (code snippet below), it seems that while the Keys and Values in the self-attention modules are modified, all the FF layers (i.e. in the format encoder.block.x.layer.x.DenseReluDense.wi) are skipped, and thus the vector $l_{ff}$ is not created in the model ($l_k$ and $l_v$ are created as expected).

t-few/src/models/lora.py

Lines 64 to 72 in 4e581fa

    
           if re.fullmatch(config.lora_layers, c_name): 
        
               assert isinstance( 
        
                   layer, nn.Linear 
        
               ), f"LoRA can only be applied to torch.nn.Linear, but {layer} is {type(layer)}." 
        
               setattr( 
        
                   module, 
        
                   c_name, 
        
                   LoRALinear(layer, config.lora_rank, config.lora_scaling_rank, config.lora_init_scale), 
        
               )

I was thus wondering if the param lora_layers should instead be "k|v|wi.*"? Or am I missing something, and the existing config file somehow also triggers the creation of $l_{ff}$, in addition to $l_k$ and $l_v$?

Thank you!

Sum of logprobs in the probability space adds up to values above 1

Hi!
Congratulations on this great work and thank you for putting up such an easy to use framework! It definitely facilitates research quite a bit :)

I was trying to interpret the scores logged during the evaluation of the development set and I realized that sometimes when summing the scores of the exponentiated negative of the scores for GT and CAND results in a sum bigger than 1 for two class datasets (like RTE). Maybe I'm interpreting these scores wrongly since I was expecting the sum of the scores (after converting them to probability space (that is np.exp(-1 * logprob)) to be less than or equal to 1 for two class datasets.

Would you let me know if my rationale is flawed and if or why the sum of the probabilities may be above 1?

Thank you in advance!

Releasing pretrained models

Hi, Will you be releasing weights of T0 (3B) pre-trained via (IA)^3?

Validation score on WSC decreases with training

Thank you for the amazing work on t-few! I've noticed strange behavior when I am running superglue's wsc. I've been logging the validation score every 40 epochs using self.eval_epoch_interval = 40 and when running the command:
python -m src.pl_train -c ia3.json+wsc.json -k save_model=False exp_name=first_exp the output is as following:

{"accuracy": 0.6730769230769231, "score_gt": 0.5068197436630726, "score_cand": 0.7191649047801127}
{"accuracy": 0.49038461538461536, "score_gt": 1.4563168384707892, "score_cand": 1.505529030584372}
{"accuracy": 0.47115384615384615, "score_gt": 3.4743554890155792, "score_cand": 2.727144861450562}
{"accuracy": 0.46153846153846156, "score_gt": 4.202766236777489, "score_cand": 3.5702959763316007}
{"accuracy": 0.40384615384615385, "score_gt": 5.157541000499175, "score_cand": 3.5657502871293287}
{"accuracy": 0.3942307692307692, "score_gt": 5.397989429533482, "score_cand": 3.975659689651086}
{"accuracy": 0.40384615384615385, "score_gt": 5.073869264469697, "score_cand": 3.995581218542961}

The last accuracy score is reported at 240 epochs out of a total 250 epochs.

Any ideas on what is going on here? Thanks!

t-few for decoder only models

Is there support for the t-few method for the case of decoder only models?

Accuracy could not match with the log when load_model

Hi, @muqeeth @dptam @craffel , when I set the eval_epoch_interval=1. I have some accuracy in my log, and I save my model and checkpoint. But when I tried to reload the model, its accuracy did not match the accuracy.

results for LoRA

Thank you for your valuable contributions. I am currently attempting to replicate the outcomes presented in your research paper. However, I am encountering difficulties in obtaining the desired results when I attempt to re-run LoRA adapters.

copa: 76.00 (2.00), h-swag: 26.64 (0.36), storycloze: 84.87 (0.21), winogrande: 51.14 (2.13), wsc: 65.38 (2.88), wic: 51.57 (0.63), rte: 59.57 (0.36), cb: 51.79 (1.79), anli-r1: 34.80 (0.80), anli-r2: 34.00 (2.40), anli-r3: 32.92 (1.08)

Have you encountered situations where the training of "h-swag" and "rte" did not yield successful results?

Issue on the install of first experiment and deepspeed in Windows

Hello, Thanks for your great contribution! I am trying to install this in the Windows.
In the tfew environment of Anaconda, due to the deepspeed==0.5.10 hard to install in the windows, i use the ==0.3.16, and installed successfully.

However, when I run my first experiment as the following code, the error is get, can you help me？

(tfew) C:\Users\78166\t-few>set HF_HOME=%USERPROFILE%.cache\huggingface

(tfew) C:\Users\78166\t-few>set CUDA_VISIBLE_DEVICES=0

(tfew) C:\Users\78166\t-few>python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp
Start experiment first_exp
{
"exp_dir": "exp_out\first_exp",
"exp_name": "first_exp",
"allow_skip_exp": true,
"seed": 42,
"model": "EncDec",
"max_seq_len": 256,
"origin_model": "bigscience/T0_3B",
"load_weight": "",
"dataset": "rte",
"few_shot": true,
"num_shot": 32,
"few_shot_random_seed": 100,
"train_template_idx": -1,
"eval_template_idx": -1,
"batch_size": 8,
"eval_batch_size": 16,
"num_workers": 8,
"change_hswag_templates": false,
"raft_cross_validation": true,
"raft_validation_start": 0,
"raft_labels_in_input_string": "comma",
"cleaned_answer_choices_b77": false,
"compute_precision": "bf16",
"compute_strategy": "none",
"num_steps": 300,
"eval_epoch_interval": 10000,
"eval_before_training": true,
"save_model": false,
"save_step_interval": 20000,
"mc_loss": 1,
"unlikely_loss": 1,
"length_norm": 1,
"grad_accum_factor": 1,
"split_option_at_inference": false,
"optimizer": "adafactor",
"lr": 0.0003,
"trainable_param_names": ".",
"scheduler": "linear_decay_with_warmup",
"warmup_ratio": 0.06,
"weight_decay": 0,
"scale_parameter": true,
"grad_clip_norm": 1,
"model_modifier": "",
"prompt_tuning_num_prefix_emb": 100,
"prompt_tuning_encoder": true,
"prompt_tuning_decoder": true,
"lora_rank": 4,
"lora_scaling_rank": 0,
"lora_init_scale": 0.01,
"lora_modules": "none",
"lora_layers": "none",
"bitfit_modules": ".",
"bitfit_layers": "q|k|v|o|wi_[01]|w_o",
"adapter_type": "normal",
"adapter_non_linearity": "relu",
"adapter_reduction_factor": 4,
"normal_adapter_residual": true,
"lowrank_adapter_w_init": "glorot-uniform",
"lowrank_adapter_rank": 1,
"compacter_hypercomplex_division": 8,
"compacter_learn_phm": true,
"compacter_hypercomplex_nonlinearity": "glorot-uniform",
"compacter_shared_phm_rule": false,
"compacter_factorized_phm": false,
"compacter_shared_W_phm": false,
"compacter_factorized_phm_rule": false,
"compacter_phm_c_init": "normal",
"compacter_phm_rank": 1,
"compacter_phm_init_range": 0.01,
"compacter_kronecker_prod": false,
"compacter_add_compacter_in_self_attention": false,
"compacter_add_compacter_in_cross_attention": false,
"intrinsic_projection": "fastfood",
"intrinsic_said": true,
"intrinsic_dim": 2000,
"intrinsic_device": "cpu",
"fishmask_mode": null,
"fishmask_path": null,
"fishmask_keep_ratio": 0.05,
"prefix_tuning_num_input_tokens": 10,
"prefix_tuning_num_target_tokens": 10,
"prefix_tuning_init_path": null,
"prefix_tuning_init_text": null,
"prefix_tuning_parameterization": "mlp-512",
"train_pred_file": "exp_out\first_exp\train_pred.txt",
"dev_pred_file": "exp_out\first_exp\dev_pred.txt",
"dev_score_file": "exp_out\first_exp\dev_scores.json",
"test_pred_file": "exp_out\first_exp\test_pred.txt",
"test_score_file": "exp_out\first_exp\test_scores.json",
"finish_flag_file": "exp_out\first_exp\exp_completed.txt"
}
Mark experiment first_exp as claimed
Traceback (most recent call last):
File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\78166\t-few\src\pl_train.py", line 86, in
main(config)
File "C:\Users\78166\t-few\src\pl_train.py", line 33, in main
tokenizer, model = get_transformer(config)
File "C:\Users\78166\t-few\src\pl_train.py", line 17, in get_transformer
tokenizer = AutoTokenizer.from_pretrained(config.origin_model)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 481, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 350, in get_tokenizer_config
use_auth_token=use_auth_token,
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\file_utils.py", line 1784, in cached_path
local_files_only=local_files_only,
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\file_utils.py", line 1947, in get_from_cache
r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\api.py", line 100, in head
return request("head", url, **kwargs)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\adapters.py", line 497, in send
chunked=chunked,
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen
self._prepare_proxy(conn)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy
conn.connect()
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connection.py", line 359, in connect
conn = self._connect_tls_proxy(hostname, conn)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connection.py", line 506, in connect_tls_proxy
ssl_context=ssl_context,
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\util\ssl.py", line 453, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls)
File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\util\ssl.py", line 495, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock)
File "C:\Users\78166\anaconda3\envs\tfew\lib\ssl.py", line 412, in wrap_socket
session=session
File "C:\Users\78166\anaconda3\envs\tfew\lib\ssl.py", line 807, in _create
raise ValueError("check_hostname requires server_hostname")
ValueError: check_hostname requires server_hostname

What does the multi_lora_a and multi_lora_b mean in the code?

Hi, may I ask that what does the multi_lora_a mean ? Is there any paper that has explained it ? Many thanks!

t-few/src/models/lora.py

Line 22 in 43fdb51

if self.scaling_rank:

Missing config.split_option_flag?

Hi, thanks for the code!

When I run:

CUDA_VISIBLE_DEVICES=0 python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp3

I get:

Reusing dataset super_glue (/localdata/hjl/hf/super_glue/rte/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)
Train size 32
Eval size 277
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Missing logger folder: /home/hjl/t-few/exp_out/first_exp3/log

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 2.8 B
-----------------------------------------------------
2.8 B     Trainable params
0         Non-trainable params
2.8 B     Total params
11,399.029Total estimated model params size (MB)
Validation sanity check:   0%|                                                                                                                                                                                                                             | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hjl/t-few/src/pl_train.py", line 86, in <module>
    main(config)
  File "/home/hjl/t-few/src/pl_train.py", line 57, in main
    trainer.fit(model, datamodule)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
    self._evaluation_loop.run()
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/opt/conda/hjl/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/hjl/t-few/src/models/EncoderDecoder.py", line 229, in validation_step
    batch_output = self.predict(batch)
  File "/home/hjl/t-few/src/models/EncoderDecoder.py", line 139, in predict
    if not self.config.split_option_flag:
AttributeError: 'Config' object has no attribute 'split_option_flag'

I can't find a reference to split_option_flag in any of the config files.
Should I manually set it?

Thanks!

	if re.fullmatch(config.lora_layers, c_name):
	assert isinstance(
	layer, nn.Linear
	), f"LoRA can only be applied to torch.nn.Linear, but {layer} is {type(layer)}."
	setattr(
	module,
	c_name,
	LoRALinear(layer, config.lora_rank, config.lora_scaling_rank, config.lora_init_scale),
	)

r-three / t-few Goto Github PK

t-few's Introduction

T-Few

Setup

Run your first experiment

Run an array of experiments

Citation

t-few's People

Contributors

Stargazers

Watchers

Forkers

t-few's Issues

Recommend Projects

Recommend Topics

Recommend Org