In this study, we extend our previous work Studying the usage of text-to-text transfer transformer for code-related tasks paying particular attention at the role played by pre-training and multi-task fine-tuning on the model's performance.
In order to pre-train and then finetune a T5 small model, we need a new sentencepiece model to accommodate the expanded vocabulary given by the java programming language, abstracted java tokens, and technical natural language.
-
How to train a new SPmodel
Pythonic way
pip install sentencepiece==0.1.96 import sentencepiece as spm spm.SentencePieceTrainer.train('--input=pretraining.txt --model_prefix=dl4se --vocab_size=32000 --bos_id=-1 --eos_id=1 --unk_id=2 --pad_id=0')
The new SPmodel has to be trained on the entire pre-training corpus. The tokenizer we trained is available here
-
To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the orignal guide provided by Google . Here the link: https://cloud.google.com/storage/docs/quickstart-console Subsequently, by following the jupyter notebook we provide for pre-train and fine-tune the network, you should be able to set up the final environment.
-
The datasets for the pre-training and the fine-tuning can be found here
-
To pre-train and then, fine-tune T5, please use the script we provide here:
-
First you need to convert the TF model into a pytorch model by using TF_to_Pytorch , then run Generate Results
-
Predictions: ๐ Click Me!
-
Under Miscellaneous, you can find the additional script used for computing the statistical tests, the complementary analysis, the overlap and data snooping analysis.