The purpose of this project is to enable language models to adapt to code completion tasks in different domains without the need for fine-tuning. The primary method used is the kNN-LM framework. Unlike the standard kNN-LM, we only store samples where the language model makes mistakes when predicting. This decouples the database from the language model's capabilities, allowing us to use Bayesian inference to merge the predictions of the database and the language model.
faiss-gpu==1.7.2
transformers==4.27.2
fuzzywuzzy==0.18.0
torch==1.13.0+cu117
Since developer completions typically occur within projects, we utilize specific commit versions of project snapshots as our retrieval database and employ newly added methods after that commit as our test set. For the database partitioning, we employ the Miner tool, which can be downloaded from here. For detailed information about this tool, please refer to their original paper [1].
As an example of building the Froyo_Email database, follow these steps:
- Step 1: Use Miner to partition the dataset.
git clone [email protected]:Dustinmj/Froyo_Email.git
# create a new file 'projects.txt' and then save the absolute path to the project 'Froyo_Email', then run
./run_miner.sh projects.txt outputs/ stats.json
- Step 2: To partition the database, tokenize, and label code token categories using the process.py.
# change the dir_path to the location of 'Froyo_Email' in your computer, then
python process.py
The data used in the paper can be found here. Here's an explanation of the provided files:
- train.txt: This file is used to build the database. Each line contains a complete method body that has already been tokenized.
- test.txt: This file is used to test token-level completion performance. Similar to the train.txt file, each line contains a complete method body that has already been tokenized.
- *_type.txt: These files contain information about token types corresponding to the tokens.
We utilized the APIBench[2] dataset to validate code completion performance in the same application scenario. For token-level code completion, we used the training set from APIBench to construct the database and the test set to evaluate completion performance. You can find the token-level completion data here.
For Android scenario line-level completion, we employed JavaParser to identify code lines containing Android API calls, allowing us to extract complete signatures for Android APIs. The processed code has been uploaded here, and the processed data can be found here. The json format is like:
{
"input": "code context to be completed",
"gt": "the ground truth of next code line",
"api_signature": "the signature of the Android API used in the line"
}
Tip: Since the dataset for the project is relatively small in scale, we do not use faiss to train vector database.
# These params are used for all models. So for any model you want to use, run this script first.
model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-java-adaptedGPT2
# or use unixcoder
# model_type=unixCoder
# pretrained_dir=microsoft/unixcoder-base
data_dir=./dataset/intra_project_completion
project_dir=${data_dir}/dataset/large_projects/AmazeFileManager
lit_file=${data_dir}/literals.json
output_dir=./save/intra_project/AmazeFileManager/${model_type}
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${project_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/base \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token
# expect output
# word acc: [54.82]
dstore_dir=${output_dir}/knn_lm/db # path to store database
dstore_size=100000000 # the max number of tokens in the db
k=8 # set for k-nearest neighbors search
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${project_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/knn_lm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token \
--with_knn \
--build_index \
--dstore_dir=${dstore_dir} \
--dstore_size=${dstore_size} \
--k=${k}
# expect output
# word acc: [58.65]
dstore_dir=${output_dir}/knm_lm/db # path to store database
dstore_size=100000000 # the max number of tokens in the db
k=8 # set for k-nearest neighbors search
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${project_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/knm_lm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token \
--with_knn \
--build_index \
--dstore_dir=${dstore_dir} \
--dstore_size=${dstore_size} \
--k=${k} \
--only_errors \
--use_bayes
# expect output
# word acc: [68.38]
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
--data_dir=${project_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/bm25 \
--pretrain_dir=${pretrained_dir} \
--model_type=${model_type} \
--dstore_file=${project_dir}/train.txt \
--data_process \
--build_index \
--do_search \
--do_generate \
--use_bm25 \
--bm_name amazefilemanager
# expect result
# word acc: 0.56879
Tip: We use faiss to train vector database for faster searching.
# These params are used for all models. So for any model you want to use, run this script first.
model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-py-adaptedGPT2
# or use unixcoder
# model_type=unixCoder
# pretrained_dir=microsoft/unixcoder-base
data_dir=./dataset/intra_scenario_completion/python
scenario_dir=${data_dir}/token_completion/DL
lit_file=${data_dir}/literals.json
output_dir=./save/intra_scenario/python/token/DL/${model_type}
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--langs=python \
--output_dir=${output_dir}/base \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token
dstore_dir=${output_dir}/knn_lm/db # path to store database
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--langs=python \
--output_dir=${output_dir}/knnlm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token \
--with_knn \
--build_index \
--need_knn_train
dstore_dir=${output_dir}/knm_lm/db # path to store database
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--langs=python \
--output_dir=${output_dir}/knmlm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=8 \
--logging_steps=100 \
--seed=42 \
--do_eval_token \
--with_knn \
--build_index \
--need_knn_train \
--only_errors \
--use_bayes
# step 1. bm25 search
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--output_dir=${output_dir}/reacc \
--pretrain_dir=${pretrained_dir} \
--model_type=${model_type} \
--dstore_file=${scenario_dir}/train.txt \
--data_process \
--do_search \
--do_generate \
--use_bm25 \
--bm_name py_dl
# step 2. dense search
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--output_dir=${output_dir}/reacc \
--pretrain_dir=${pretrained_dir} \
--model_type=${model_type} \
--dstore_file=${scenario_dir}/train.txt \
--do_search \
--do_generate \
--use_dense \
--build_index
# step 3. hybrid
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--output_dir=${output_dir}/reacc \
--pretrain_dir=${pretrained_dir} \
--model_type=${model_type} \
--dstore_file=${scenario_dir}/train.txt \
--do_search \
--do_generate \
--use_hybrid
model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-java-adaptedGPT2
data_dir=./dataset/intra_scenario_completion/java
lit_file=${data_dir}/literals.json
output_dir=./save/intra_scenario/java/line/Android/${model_type}
dstore_dir=${output_dir}/knm_lm/db
# 1. build the database (same as token completion for Android.)
scenario_dir=${data_dir}/token_completion/Android
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/knm_lm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=4 \
--logging_steps=100 \
--seed=42 \
--build_index \
--with_knn \
--need_knn_train \
--dstore_dir=${dstore_dir} \
--only_errors \
--use_bayes
# step 2. inference next line.
scenario_dir=${data_dir}/line_completion/Android
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
--data_dir=${scenario_dir} \
--lit_file=${lit_file} \
--langs=java \
--output_dir=${output_dir}/knm_lm \
--pretrain_dir=${pretrained_dir} \
--log_file=log.log \
--model_type=${model_type} \
--block_size=1024 \
--per_gpu_eval_batch_size=4 \
--logging_steps=100 \
--seed=42 \
--do_eval_line \
--with_knn \
--dstore_dir=${dstore_dir} \
--only_errors \
--use_bayes
I found a bug in line-level code completion and have fixed it. So the result of line-level code completion may be a little different from that in the paper. The bug code is in 684 line of ./code/run_lm.py and 68 line of ./reacc/run_line_com.py
Before: model_outputs = model(inputs)
After: model_outputs = model(inputs[:, :-1])
[1] Egor Bogomolov, Sergey Zhuravlev, Egor Spirin, Timofey Bryksin:Assessing Project-Level Fine-Tuning of ML4SE Models. CoRR abs/2206.03333 (2022)
[2] Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu:Revisiting, Benchmarking and Exploring API Recommendation: How Far Are We? IEEE Trans. Software Eng. 49(4): 1876-1897 (2023)
[3] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis:Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020
This repository is inspired by the code from this repository: https://github.com/neulab/knn-transformers. We greatly appreciate the authors for providing their code.