The temp_xalign from shivprasadsagare

temp_xalign

temporary repo for sharing code with Manish Sir.

Tip: All the file paths below are w.r.t role_specific_finetuner/. Make sure you cd into it before running the scripts.

Setting up environment and data

Setup the conda environment

conda create --name xalign_role python==3.7
conda activate xalign_role
pip install -r requirements.txt
cd transformers
pip install .
cd ..

Setup the data

2.1 Multilingual training experiment

wiki pretraining data: https://drive.google.com/file/d/1zvPuzPntsRtMisPfanAG_qH6oePKq_M0/view?usp=sharing
xalign finetuning data: https://drive.google.com/file/d/1xuS8zvq4k2F6Uxfx6XF6R3ueKNNFodjl/view?usp=sharing

Alternatively, you can use following gdown commands to directly download it, and unzip the files into the newly created ./data directory. Final directory structures should look like ./data/wiki and ./data/xalign_unified_script.

mkdir data
cd data

gdown 1zvPuzPntsRtMisPfanAG_qH6oePKq_M0 
unzip wiki.zip

gdown 1xuS8zvq4k2F6Uxfx6XF6R3ueKNNFodjl
unzip xalign_unified_script.zip

cd ..

2.2 Only english training experiment

wiki pretraining data: https://drive.google.com/file/d/1JKPzlkfusE0rkpm3q0CBRc6cwSk2CzzD/view?usp=sharing
xalign finetuning data: https://drive.google.com/file/d/1iEaYO0dE7owvX3m5suHEPZVctDpD5RQq/view?usp=sharing

mkdir data
cd data

gdown 1JKPzlkfusE0rkpm3q0CBRc6cwSk2CzzD
unzip wiki_only_english.zip

gdown 1iEaYO0dE7owvX3m5suHEPZVctDpD5RQq
unzip xalign_only_english.zip

cd ..

2.3 Multilingual pretraining experiment (increased data size)

wiki pretraining data: https://drive.google.com/file/d/1HTyF-IgeG1J-I-1Xn_gsRhKYEzDV31wy/view?usp=sharing
xalign finetuning data: https://drive.google.com/file/d/1FR9Dpab5Z-hF0BM-IO1d-9MnElCNApaK/view?usp=sharing

mkdir data
cd data

gdown 1HTyF-IgeG1J-I-1Xn_gsRhKYEzDV31wy
unzip wiki_unified_script.zip

gdown 1FR9Dpab5Z-hF0BM-IO1d-9MnElCNApaK
unzip xalign_unified_script.zip

cd ..

2.4 Multilingual finetuning experiment (with ordered facts)

xalign finetuning data: https://drive.google.com/file/d/1FR9Dpab5Z-hF0BM-IO1d-9MnElCNApaK/view?usp=sharing

2.5 SWFT multilingual pretraining and then XAlign finetuning

Before running this experiment, make sure you have standard recent installation of transformers, as opposed to custom edited version of transformers. For the same, uninstall existing transformers library, and install it again using pip install transformers.

pretraining data: https://drive.google.com/file/d/18FdyPIR86wD8hvUYC-sU_BEO4WZeurXv/view?usp=sharing
finetuning data: https://drive.google.com/file/d/1ancr8wa8gIWKrmOybi2UahAteaqFst_C/view?usp=sharing

mkdir data
cd data

gdown 18FdyPIR86wD8hvUYC-sU_BEO4WZeurXv
unzip wiki_pretraining.zip

gdown 1ancr8wa8gIWKrmOybi2UahAteaqFst_C
unzip xalign_HRT_only.zip

cd ..

Make sure you are on the master branch. Then run the file run_both.sh to run pretraining and finetuning sequentially automatically. Set log_dir and subsequent checkpoint_path correctly. Set other hyperparams as well before running.

2.6 KGPT multilingual pretraining and then XAlign finetuning

pretraining data:
finetuning data:

mkdir data
cd data

gdown 
unzip wiki_pretraining.zip

gdown 
unzip xalign_finetuning.zip

cd ..

Then run the file run_pretrain.sh with proper hyperparams edited in it. Next, run run_finetune.sh with the checkpoint_path argument in it set to the checkpoint path created from previous pretraining. Set other hyperparams as well before running.

2.7 SWFT only_en pretraining and then XAlign finetuning

pretraining data: https://drive.google.com/file/d/13Yb7eGuZ3-C3JdWQnFC0UepwQXbfGgJh/view?usp=sharing
finetuning data: https://drive.google.com/file/d/1ancr8wa8gIWKrmOybi2UahAteaqFst_C/view?usp=sharing

mkdir data
cd data

gdown 13Yb7eGuZ3-C3JdWQnFC0UepwQXbfGgJh
unzip wiki_pretraining.zip

gdown 1ancr8wa8gIWKrmOybi2UahAteaqFst_C
unzip xalign_HRT_only.zip

cd ..

2.8 XAlign finetuning with enhanced encoding strategy (role_specific encoding) Before running this experiment, make sure you have custom edited installation of transformers, as opposed to standard version of transformers. For the same, first switch the branch role_ids, then uninstall existing transformers library, and install it again using cd transformers;pip install ..

finetuning data: https://drive.google.com/file/d/1JdtYxJlAYp7UGaD9mkY32ggehlPYcjJ4/view?usp=sharing

Then run the file run_finetuning.sh with proper hyperparams edited in it. Set the checkpoint path to None.

2.9 XAlign finetuning with copy_mechanism and role_specific encoding combined

Switch to the branch role_ids_copy. Before running this experiment, make sure you have custom edited installation of transformers, as opposed to standard version of transformers. For the same, first switch the branch role_ids_copy, then uninstall existing transformers library, and install it again using cd transformers;pip install ..

finetuning data: https://drive.google.com/file/d/1R-PjPp2ylKKWkkoLxbJz3c_vxszBqsTv/view?usp=sharing

Then run the file run_finetuning.sh with proper hyperparams edited in it. Set the checkpoint path to None.

2.10 finetuning_HRTQRQT_multi_pretrained_ckpt_on_only_HRT_dataset

Git clone and stay on main branch. Before running this experiment, make sure you have standard recent installation of transformers, as opposed to custom edited version of transformers. For the same, uninstall existing transformers library, and install it again using pip install transformers.

Download and unzip the data in data/ directory, using this link https://drive.google.com/file/d/1ancr8wa8gIWKrmOybi2UahAteaqFst_C/view?usp=sharing

Then , run run_finetine.sh, with appropriate parameters in it (epochs:30, checkpoint_path:'wandb'). Please keep checkpoint_path as 'wandb' only, exact location in the cloud is hardcoded in the script inside, we don't need to tweak it.

2.11 finetuning_HRTQRQT_english_pretrained_ckpt_on_only_HRT_dataset

git clone and then switch to the branch temp_en. Rest all the instructions and data file is same as above experiment 2.10.

2.12 Combined instructions for running multi-stage and multi-task pretraining experiments.

Please switch to the multi_stage_task_pretraining branch for both the experiments.

Link to the data (same for both experiments): https://drive.google.com/file/d/1nokKQhGPe_3CvLGYH87pnyWvTD31ppyZ/view?usp=sharing

unzip the file data.zip and place data directory in role_specific_finetuner/. It has 4 subdirectories, stage_1, stage_2, stage_1_and_2, stage_3.

Please follow the files run_multi_stage.sh and run_multi_task.sh to start both experiments. Also, follow the data paths, --log_dir, and--checkpoint_path values in it carefully.

2.13 finetuning vanilla mT5 on xalign role-ordered dataset(HRT only)
git clone the repo and stay on main branch.

role_ordered_xalign_HRT_only dataset: https://drive.google.com/file/d/1_8B6qB_nvzCkRDW6hxfaSdKEJXE2Av5X/view?usp=sharing

mkdir data
cd data

gdown 1_8B6qB_nvzCkRDW6hxfaSdKEJXE2Av5X
unzip role_ordered_xalign_HRT_only.zip
cd ..

run run_finetune.sh with proper arguments for batch size and GPUs.

2.14 finetuning role_specific mT5 on xalign role-ordered dataset(HRT only)
git clone the repo and switch to original_role_ids branch

Before running this experiment, make sure you have custom edited installation of transformers, as opposed to standard version of transformers. For the same, first make sure you are on the branch original_role_ids, then uninstall existing transformers library, and install it again using cd transformers;pip install ..

setup the data similary as in above experiment 2.13.
run run_finetune.sh with proper arguments for batch size and GPUs.

2.15 bilingual model per language. finetuning vanilla mT5 on xalign role-ordered dataset(HRT only)
Git pull the latest and checkout to the branch bilingual.

setup the data similary as in above experiment 2.13.

run run_all_lang.sh with proper arguments for batch size and GPUs and sanity_run=no.

2.16 bilingual model per language. finetuning vanilla mT5 on xalign original(not-ordered) dataset(HRT only) (For Tushar da)
Git pull the latest and checkout to the branch bilingual.

xalign_HRT_only dataset: https://drive.google.com/file/d/1ancr8wa8gIWKrmOybi2UahAteaqFst_C/view?usp=sharing

mkdir data
cd data

gdown 1ancr8wa8gIWKrmOybi2UahAteaqFst_C
unzip xalign_HRT_only.zip
cd ..

run run_all_lang.sh with proper arguments for batch size and GPUs and sanity_run=no. Make sure to insert right data path arguments in this script (xalign_HRT_only/train.csv)(similarly for val and test).

2.17 large model training for best model
Git pull the latest and checkout to the branch SPO_role_ids.

Before running this experiment, make sure you have custom edited installation of transformers, as opposed to standard version of transformers. For the same, first make sure you are on the branch SPO_role_ids, then uninstall existing transformers library, and install it again using cd transformers;pip install ..

gdown the xalign_unified_script dataset: https://drive.google.com/file/d/1xuS8zvq4k2F6Uxfx6XF6R3ueKNNFodjl/view?usp=sharing

run run_finetune.sh with proper arguments for batch size and GPUs and sanity_run=no. Just confirm once that the tokenizer and model name is set to google/mt5-base.

2.18 data_scaling experiment

Git pull the latest and checkout to the branch data_scaling_original_role_ids.

Before running this experiment, make sure you have custom edited installation of transformers, as opposed to standard version of transformers. For the same, first make sure you are on the branch data_scaling_original_role_ids, then uninstall existing transformers library, and install it again using cd transformers;pip install ..

gdown the xalign_unified_script dataset: https://drive.google.com/file/d/1xuS8zvq4k2F6Uxfx6XF6R3ueKNNFodjl/view?usp=sharing

run run_finetune.sh with proper arguments for batch size and GPUs and sanity_run=no. Make sure to pass the data_scaling_factor appropriately.

Some miscelleneous tips before running experiment

~~Set up the wand api key using export WANDB_API_KEY=bf6eddaca0cddb4d9e70aa37fb5ef56202d7ef74~~
I have set up new wandb account for having more cloud storage. Please use this key starting August 9, 2022. Please type export WANDB_API_KEY=bcb8767e750b1d80ba93361478ba51b615f2b8ce
Make sure to sanity run the following scripts first, by passing sanity_run=yes in respective scripts. Later, pass sanity_run=no while running actual training.
Pass appropriate arguments for batch size, gpus, epochs, etc. in following scripts.
conda activate xalign_role from shell can throw error sometimes. Refer to initial few commented lines to reolve the issue instantly, if it occurs.

Steps to run the pretraining experiment

run bash run_pretrain.sh

Steps to run the finetuning experiment

set checkpoint_path variable in run_finetune.sh to path of the checkpoint created from above pretraining experiment.
run bash run_finetune.sh

shivprasadsagare / temp_xalign Goto Github PK

temp_xalign's Introduction

temp_xalign

Setting up environment and data

Steps to run the pretraining experiment

Steps to run the finetuning experiment

temp_xalign's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent