This repository contains the dataset and source codes for "Review4Repair: Code Review Aided Automatic Program Repairing".
Index | Model Name | Link |
---|---|---|
1 | best_model_cc_hard | link |
2 | best_model_c_hard | link |
3 | best_model_cc_soft | link |
4 | best_model_c_soft | link |
5 | best_model_c_10k_vocab | link |
6 | best_model_c_20k_vocab | link |
7 | best_pretrained_model_c_hard | link |
To run inference and evaluation, unzip the particular checkpoint. Sample tokenized test sets are available in "Inference" directory.
Evaluation: python Inference\evaluation.py <test_set_src> <test_set_tgt> <model_name>
Predict: python Inference\prediction.py <test_set_src> <model_name>
For example, to evaluate model_cc on test set,
python Inference\evaluate.py Inference\CC_sample_src-test.txt Inference\CC_sample_tgt-test.txt best_model_cc_hard.pt
To predict using model_cc on test set,
python Inference\predict.py Inference\CC_sample_src-test.txt best_model_cc_hard.pt
For details on model training, refer to OpenNMT documentation. To run training:
!onmt_preprocess -train_src training_data/<SRC_DATA>.txt \
-train_tgt training_data/c/<TGT_DATA>.txt \
-save_data <SAVE_DIR> \
-src_vocab vocab/<SRC_VOCAB_FILE>.txt \
-tgt_vocab vocab/<TGT_VOCAB_FILE>.txt \
-src_vocab_size <SRC_VOCAB_SIZE> \
-tgt_vocab_size <TGT_VOCAB_SIZE> \
-src_seq_length <SRC_LEN> \
-src_seq_length_trunc <SRC_LEN> \
-tgt_seq_length 100 \
-tgt_seq_length_trunc 100 \
-dynamic_dict \
-overwrite
<SRC_LEN> = 600 for model_cc, 400 for model_c
Our database includes a total of 35 tables, providing useful information about changes codes, reviewer's details, review time, review and corresponding response etc. We highlight some major table from our database below. The full ERD diagram is attached here.
To reproduce a particular database <DB_NAME>, follow this following query.
cd PROJECT_NAME
mysql -u root -p <DB_NAME> < <PROJECT_NAME>.sql
The dataset follows the following architecture.
├───acumos
├───acumos.sql
├───gerrit_db_acumos.json
└───Downloaded_Codes_acumos.zip
├───acumos_1
├───1.java
├───2.java
└───....
├───....
└───acumos_MapJSON.json
├───android
├───....
├───unicorn
└───unified_with_date.json
We provide the database of each project seperately for modularity. The database and downloaded codes are provided in each project folder. The downloaded codes are provided in a zipped file. Under each folder in the zipped file, there are multiple versions of the same file at different commit times. Each zipped folder also contains a <PROJECT_NAME_>MapJSON file that maps the folder name with the corresponding one in the database and json file. Each project folder contains a Json file named gerrit_db_<PROJECT_NAME> which contains the code review with the Java code file name before and after the change. A sample example is given below:
{
"comment_id" : "3a045101_dd7d55b3",
"message" : "GROUP_INTERFACE string given is wrong.Please modify.",
"file_name" : "android-android_api-base-src-main-java-org-iotivity-base-OcPlatform.java",
"line_number" : 47,
"base_patch_number" : 3,
"changed_patch_number" : 5,
"line_change" : 1
}
To create a Json from database for other programming language, for example c, execute the following query and export as a Json:
use <DB_NAME> ;
select c.comment_id, cu.message, c.base_code as file_name, c.line_number,
c.base_patch_number, c.changed_patch_number, cu.line_change, cu.written_on
from code c
inner join comment_usefulness cu
on c.comment_id = cu.comment_id
where c.changed_patch_number != -1
and c.base_code like "%.c"
and cu.in_reply_to is null;
The root directory contains a file named 'unified_with_date.json' which contains the combined query result of the fifteen projects. The entire dataset is also available in Mega.
Input: linediff file1 file2 line_number change_window_size
Output: change_type <|sep|> input_code <|sep|> output_code
Input: alldiff file1 file2 max_target_len max_input_len
Output:
If no one line change found:
<|nochange|>
If change found:
input_code <|sep|> output_code <|datasep|> -- repeat
- Separate with <|datasep|> to get the changes
- Split one datapoint with <|sep|>
Input: oneliner file1 file2
Output:
If no one line change found:
<|nochange|>
Else:
original_pos1 revised_pos1
original_pos2 revised_pos2
original_pos3 revised_pos3
. . .
(split with newline, then split with space)