This project investigates the impact of Part-Of-Speech (POS)-aware data augmentation on the performance of the roberta-base model across four Super GLUE tasks. Our experiments showed that POS-aware augmentation techniques outperformed random augmentation methods, and introduced a more stable training process on challenging tasks like WiC and RTE. However, it is not a silver bullet, and the augmentation requires a task-specific parameter tuning to achieve the best performance (or just improvement).
To run the training script, you need to have Python 3.12 and the required packages installed.
pip install -r requirements.txt
Additionally, you need to fill .env file with your Neptune.ai NEPTUNE_PROJECT
and NEPTUNE_API_TOKEN
to log the experiments.
transformers
for obtaining the checkpoints, training loop and evaluationdatasets
for loading the Super GLUE datasetsfast-aug
- our custom library for random data augmentation - written on rust with python bindingsneptune
for logging the experiments (runs available)
To get all the available options, run:
python main.py --help
For example, to train the roberta-base model on the WiC task with POS-aware substitution augmentation, run:
python main.py --task_name super_glue/wic --model_name roberta-base --augmentation words-pos-sub