nougatca / finetuner Goto Github PK

License: GNU General Public License v3.0

Python 99.83% Shell 0.17%

finetuner's Introduction

FineTuner

Requirements

The evaluation scripts in src/evaluation and the src/main.py script to fine-tuning and/or start the model's evaluation require the Python packages listed in requirements.txt which can be easily installed via pip install -r requirements.txt.

wandb

The src/main.py script requires user's wandb key. Sign up for an account at https://wandb.ai and copy your user's key to wandb_api.key in the root directory. In alternative, you could just export WANDB_API_KEY=___USER_KEY__ before running the src/main.py script.

Datasets

All datasets can be downloaded here: OneDrive, Zenodo. Extract the archive file and put the entire folder in the root directory. Or you can put anywhere else and specific the path using the --data_dir argument.

Pre-Trained Models and Tokenizer

All pre-trained models and tokenizer can be downloaded here: OneDrive, Zenodo

Evaluation Scripts

Very easy to use evaluation scripts have been created in src/evaluation with detailed comments to refer to.

Statistical Significance of the results reported in the paper

OneDrive

Runs

Run main.py to start fine-tuning and/or evaluation. All arguments are located in args.py, specific whatever you need.

Some example scripts are as following.

# run defect detection task using roberta, with default hyperparameters
python main.py --model roberta --task defect

# run java summarization using codet5
python main.py --model codet5 --task summarization --subset java

# only run evaluation using specific model directory
python main.py --model PATH_TO_MODEL --task clone --only_test

# run code generation using plbart and specific some common arguments
# all gpu devices are used by default, specific device ids by using --cuda_visible_devices, 
# add --no_cuda to disable gpu and use cpu instead
python main.py \
--model plbart \
--task generation \
--num_epochs 10 \
--train_batch_size 64 \
--eval_batch_size 32 \
--max_source_length 64 \
--max_target_length 256 \
--learning_rate 1e-5 \
--num_warmup_steps 1000 \
--cuda_visible_devices 2,3 \
--mixed_precision no # no, fp16, bf16

finetuner's People

Contributors

Stargazers

Watchers

Forkers

veerumehta jose kevinyoung98

finetuner's Issues

Failed data and results links

Hello, Changan,

It's really an interesting work.

But it seems that the data and results links are invalid. Can you check and provide them again?

Looking fowared to your reply. Thx in advance!

Some errors that occurred while fine-tuning models such as codebert

老师您好！
When I fine-tune codebert, graphcodebert and unixcoder on the downstream tasks, they all have the same error, which is as follows:==================== LOADING ==================== Loaded config 'RobertaConfig' from '/data/pretrain-model/unixcoder-base' Loaded tokenizer 'RobertaTokenizerFast' from '/data/pretrain-model/unixcoder-base', size: 51416 Loaded unwrapped model 'RobertaModel' from '/data/pretrain-model/unixcoder-base' Trainable parameters: 126M Traceback (most recent call last): File "/data/FineTuner/src/main.py", line 113, in <module> main() File "/data/FineTuner/src/main.py", line 109, in main run_fine_tune(args, accelerator, run) File "/data/FineTuner/src/run_fine_tune.py", line 270, in run_fine_tune model, tokenizer = build_model_tokenizer(args) File "/data/FineTuner/src/models.py", line 320, in build_model_tokenizer model = EncoderDecoderModel(config, encoder=model) File "/usr/local/lib/python3.9/dist-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py", line 195, in __init__ raise ValueError(f"Config: {config} has to be of type {self.config_class}") ValueError: Config: RobertaConfig {...} has to be of type <class 'transformers.models.encoder_decoder.configuration_encoder_decoder.EncoderDecoderConfig'>

关于SynCoBERT的代码问题

Niu老师您好：
求助：
“(4) If no source code is provided, we re-implement and
pre-train according to the settings (e.g., tokenizer, hyper�
parameters, and dataset) described in the original papers. They
are GPT-C, C-BERT, DeepDebug and SynCoBERT4.”

“4To verify the validity of the latter two types of models pre-trained by us,
we perform fifine-tuning on the downstream tasks corresponding to the original
paper and use pair-wise t-tests to ensure that the difference between our results
and those reported in the original papers are statistically indistinguishable.
Details can be found in the supplementary materials.”

Models / Tokenizers

Hi @NougatCA,

I'm trying to understand from where did you get the models/tokenizers and here is the breakdown of all models evaluated in the empirical study and listed in the pre-print.

Model	HuggingFace	Repository
PTM-NL
RoBERTa [7]	roberta-base	facebookresearch/fairseq
GPT-2 [9]	gpt2	openai/gpt-2
BART [11]	facebook/bart-base	facebookresearch/bart
T5 [10]	t5-base	google-research/text-to-text-transfer-transformer

PTM-C
CuBERT [12]	(available elsewhere)	google-research/cubert
GPT-C [14]	(available elsewhere)	---
C-BERT [13]	(available elsewhere)	---
JavaBERT [15]	CAUKiel/JavaBERT	cau-se/gh-archive-code-retrieval
CodeGPT-adapted [35]	microsoft/CodeGPT-small-java-adaptedGPT2	microsoft/CodeXGLUE
DeepDebug [16]	niuca/DeepDebug (not official)	---

CodePTM
CodeBERT [53]	microsoft/codebert-base	microsoft/CodeBERT
GraphCodeBERT [54]	microsoft/graphcodebert-base	microsoft/GraphCodeBERT
CugLM [40]	(available elsewhere)	LiuFang816/CugLM
DOBF [55]	(available elsewhere)	facebookresearch/CodeGen
T5-learning [56]	niuca/T5-learning (not official) (or available elsewhere)	antonio-mastropaolo/T5-learning-ICSE_2021
PLBART [57]	uclanlp/plbart-base	wasiahmad/PLBART
ProphetNet-Code [58]	(available elsewhere)	microsoft/ProphetNet_Code
CoTexT [59]	razent/cotext-2-cc	justinphan3110/CoTexT
TreeBERT [28]	(available elsewhere)	17385/TreeBERT
CodeT5 [62]	Salesforce/codet5-base	salesforce/CodeT5
SynCoBERT [63]	???	---
SPT-Code [29]	(available elsewhere)	NougatCA/SPT-Code
UniXcoder [64]	microsoft/unixcoder-base	microsoft/UniXcoder

(Note: SCELMo [52], OSCAR [60], and CodeDisen [61] have been excluded for several reasons. See the pre-print for more details.)

Questions/Comments regarding the table above:

Regarding the GPT-2 [9] model, why did you use the distilgpt2 model instead of the gpt2?
According to the pre-print, there was no pre-trained model neither source code of GPT-C [14], C-BERT [13], and DeepDebug [16]. Thus, you re-implemented and pre-trained all of them according to the settings (e.g., tokenizer, hyperparameters, and dataset) described in the original papers. Those are kindly provided by you in here, thanks for that.
According to the pre-print, there was no pre-trained model neither source code of SynCoBERT [63] and therefore you re-implemented and pre-trained it as described in the original paper. Did you by any chance forgot to include SynCoBERT in the zip file you kindly provided here?

--
Best,
Jose

Large Dataset Multiprocessing Issue

When I try to have a new large dataset (15M pairs) to test code clones on different models, I get an error related to multiprocessing encoding. Any ideas or suggestions? I suppose it's related to the dataset being large and the CPU freezes dealing with it. I tried reducing the batch size and max_length and still, the problem persists. The system I'm using is Linux.

Error Message:

Killed
[usr]$ Process ForkPoolWorker-3:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 409, in _send_bytes
self._send(buf)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-4:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 204, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 408, in _send_bytes
self._send(header)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 372, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

_pickle.PicklingError: Can't pickle <function AcceleratedOptimizer.step>

$ python main.py --model codet5 --task assert --subset raw --train_batch_size 16 --eval_batch_size 16

results in the following error:

==================== INITIALIZING ====================
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

...

==================== LOADING ====================
Loaded config 'T5Config' from 'Salesforce/codet5-base'
Loaded tokenizer 'RobertaTokenizerFast' from 'Salesforce/codet5-base', size: 32100
Loaded unwrapped model 'T5ForConditionalGeneration' from 'Salesforce/codet5-base'
Loaded model 'T5ForConditionalGeneration' from 'Salesforce/codet5-base'
Trainable parameters: 223M
Start loading train data from ../datasets/assert/assert/raw
Loading train data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150523/150523 [00:00<00:00, 158473.45it/s]
train data loaded, total size: 150523
Start encoding train data into input features
Encoding: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150523/150523 [00:16<00:00, 8902.78it/s]
train data encoded, total size: 150523
Start loading valid data from ../datasets/assert/assert/raw
Loading valid data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18816/18816 [00:00<00:00, 339752.22it/s]
valid data loaded, total size: 18816
Start encoding valid data into input features
Encoding: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18816/18816 [00:08<00:00, 2348.51it/s]
valid data encoded, total size: 18816
Data is loaded and prepared

...

==================== TRAINING ====================

...

The best em model is saved to ../outputs/codet5_assert_assert_raw_bs16_ep30_lr5e-05_warmup1000_20230305_151212/models/best_em
Traceback (most recent call last):
  File "/tmp/FineTuner/src/main.py", line 113, in <module>
    main()
  File "/tmp/FineTuner/src/main.py", line 109, in main
    run_fine_tune(args, accelerator, run)
  File "/tmp/FineTuner/src/run_fine_tune.py", line 416, in run_fine_tune
    torch.save(optimizer, os.path.join(save_last_dir, "optimizer.pt"))
  File "/tmp/FineTuner/fine-tuner/lib/python3.9/site-packages/torch/serialization.py", line 423, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/tmp/FineTuner/fine-tuner/lib/python3.9/site-packages/torch/serialization.py", line 635, in _save
    pickler.dump(obj)
_pickle.PicklingError: Can't pickle <function AcceleratedOptimizer.step at 0x7ff18e0d3670>: it's not the same object as accelerate.optimizer.AcceleratedOptimizer.step

PS: This issue assumes #5 has been accepted and merged.

[Bug] the format with reading concode data for code generation is wrong

Thanx for the public code and wonderful work!
When i run the code generation task, i just found a problem as following:

So i search the origin of the split data, and then i found the code in data.py

for idx, line in enumerate(tqdm(lines, total=len(lines), desc=f"Loading {split} data")):
        js = json.loads(line.strip())
        if args.task == "summarization":
            source = " ".join(js["code_tokens"])
            target = " ".join(js["docstring_tokens"])
        else:
            source = " ".join(js["nl"])
            target = " ".join(js["code"])

This part shouldn't be " ". For example. the result should be "int", but i can only get "i n t".
I think it will be ok:

for idx, line in enumerate(tqdm(lines, total=len(lines), desc=f"Loading {split} data")):
        js = json.loads(line.strip())
        if args.task == "summarization":
            source = " ".join(js["code_tokens"])
            target = " ".join(js["docstring_tokens"])
        else:
            source = "".join(js["nl"])
            target = "".join(js["code"])

1 GPU when multiple are available

As discussed here, it seems that the main.py script is not capable of running on multiple GPUs when there are multiple GPUs available.

When I run

$ python main.py --model codet5 --task assert --subset raw

I see the following output

==================== INITIALIZING ====================
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

...

where is written

Distributed environment: NO
Num processes: 1

I was expecting to see

Distributed environment: YES
Num processes: 2

My system

$ nvidia-smi -L
GPU 0: NVIDIA Tesla T4 (UUID: GPU-e239a762-57ba-9dd6-902f-86c53bb8862f)
GPU 1: NVIDIA Tesla T4 (UUID: GPU-fcbaf79c-8afc-fc83-6adf-1245770db6b2)

import torch
use_cuda = torch.cuda.is_available()

if use_cuda:
  print('__CUDNN VERSION:', torch.backends.cudnn.version())
  print('__Number CUDA Devices:', torch.cuda.device_count())
  print('__CUDA Device Name:',torch.cuda.get_device_name(0))
  print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

  from tensorflow.python.client import device_lib
  local_device_protos = device_lib.list_local_devices()
  [x.name for x in local_device_protos if x.device_type == 'GPU']

__CUDNN VERSION: 8500
__Number CUDA Devices: 2
__CUDA Device Name: NVIDIA Tesla T4
__CUDA Device Total Memory [GB]: 15.843721216
['/device:GPU:0', '/device:GPU:1']

Missing datasets/assert directory

Hi @NougatCA,

The datasets.zip file available here is missing the assert directory.

$ python main.py --model codet5 --task assert --subset raw
==================== INITIALIZING ====================
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

wandb: Currently logged in as: josecampos (uporto). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in /tmp/FineTuner/src/wandb/run-20230301_121613-rg039n1c
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run codet5_assert_assert_raw
wandb: ⭐️ View project at https://wandb.ai/uporto/CodePTM_Evaluation
wandb: 🚀 View run at https://wandb.ai/uporto/CodePTM_Evaluation/runs/rg039n1c
==================== LOADING ====================
Loaded config 'T5Config' from 'Salesforce/codet5-base'
Loaded tokenizer 'RobertaTokenizerFast' from 'Salesforce/codet5-base', size: 32100
Loaded unwrapped model 'T5ForConditionalGeneration' from 'Salesforce/codet5-base'
Loaded model 'T5ForConditionalGeneration' from 'Salesforce/codet5-base'
Trainable parameters: 223M
Start loading train data from ../datasets/assert/assert/raw
Traceback (most recent call last):
  File "/tmp/FineTuner/src/main.py", line 113, in <module>
    main()
  File "/tmp/FineTuner/src/main.py", line 109, in main
    run_fine_tune(args, accelerator, run)
  File "/tmp/FineTuner/src/run_fine_tune.py", line 277, in run_fine_tune
    train_examples, train_dataset, train_dataloader = prepare_data(args, split="train", tokenizer=tokenizer)
  File "/tmp/FineTuner/src/data.py", line 504, in prepare_data
    examples = load_examples(args, split=split, aux_data=aux_data)
  File "/tmp/FineTuner/src/data.py", line 204, in load_examples
    with open(source_path, mode="r", encoding="utf-8") as source_f, \
FileNotFoundError: [Errno 2] No such file or directory: '../datasets/assert/assert/raw/train_methods.txt'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run codet5_assert_assert_raw at: https://wandb.ai/uporto/CodePTM_Evaluation/runs/rg039n1c
wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230301_121613-rg039n1c/logs

$ ls /tmp/FineTuner/datasets/
README.md  bug  clone  defect  generation  qa  retrieval  search  summarization  translation

Could you please update the zip file to also include the assert directory? Thanks in advance.

--
Best,
Jose

Assert generation quick example

Hi @NougatCA,

Thanks for sharing this repository and congratulations for your ICSE'23 paper.

Any chance you could provide a quick example on how to generate asserts for an assertless test case using the top-5 performers, i.e., PLBART, TreeBERT, SPT-Code, T5-learning, and CodeT5? I'm not interesting in training, fine-tuning, or evaluating those models as you did in the paper. I would like to just use them from a developer / end-user point of view. Any source code on how to instantiate and use those models would be much appreciated.

Thanks in advance.

--
Best,
Jose

How would one use a model that is not available on huggingface?

Hi @NougatCA,

Some of the models evaluated in the empirical study are not available on huggingface (CuBERT [12], GPT-C [14], C-BERT [13], CugLM [40], DOBF [55], ProphetNet-Code [58], TreeBERT [28], SynCoBERT [63], and SPT-Code [29]) but you kindly provided an offline version of those in here. Thanks for that. (Note: SynCoBERT [63] is currently missing in the zip file, see #9.)

Question is, how would one use/invoke the main.py script on a model that is not available on huggingface?

In #1 you said,

... main.py uses the .from_pretrained(model) functions provided by HuggingFace transformer package. In addition to providing a model id similar to Salesforce/codet5-base, etc., you can also provide the model's local path such as /path/to/local/dir/ where the checkpoint exists.

but that does not work.

$ ls -lh /tmp/FineTuner/models/treebert
-rw-r--r-- 1 foo Domain Users 851M Mar  1 23:03 pytorch_model.bin

$ python main.py --model /tmp/FineTuner/models/treebert

main.py: error: argument --model: invalid choice: '/tmp/FineTuner/models/treebert' (choose from 'transformer', 'bert', 'roberta', 'gpt2', 'bart', 't5', 'codebert', 'graphcodebert', 'javabert', 'codegpt', 'codegpt-adapted', 'deepdebug', 't5-learning', 'plbart', 'cotext', 'codet5', 'unixcoder')

Any idea on how to adapt the main.py script to support offline models?

--
Best,
Jose

nougatca / finetuner Goto Github PK

finetuner's Introduction

FineTuner

Requirements

Datasets

Pre-Trained Models and Tokenizer

Evaluation Scripts

Statistical Significance of the results reported in the paper

Runs

finetuner's People

Contributors

Stargazers

Watchers

Forkers

finetuner's Issues

My system

Recommend Projects

Recommend Topics

Recommend Org