Code Monkey home page Code Monkey logo

ase2023_knm-lm's Introduction

Introduction

The purpose of this project is to enable language models to adapt to code completion tasks in different domains without the need for fine-tuning. The primary method used is the kNN-LM framework. Unlike the standard kNN-LM, we only store samples where the language model makes mistakes when predicting. This decouples the database from the language model's capabilities, allowing us to use Bayesian inference to merge the predictions of the database and the language model.

image-20230920140126952

Quickstart

1. Install requirements

faiss-gpu==1.7.2

transformers==4.27.2

fuzzywuzzy==0.18.0

torch==1.13.0+cu117

2. Dataset and Processing

2.1 Build Intra-Project Dataset

Since developer completions typically occur within projects, we utilize specific commit versions of project snapshots as our retrieval database and employ newly added methods after that commit as our test set. For the database partitioning, we employ the Miner tool, which can be downloaded from here. For detailed information about this tool, please refer to their original paper [1].

As an example of building the Froyo_Email database, follow these steps:

  • Step 1: Use Miner to partition the dataset.
git clone [email protected]:Dustinmj/Froyo_Email.git

# create a new file 'projects.txt' and then save the absolute path to the project 'Froyo_Email', then run

./run_miner.sh projects.txt outputs/ stats.json
  • Step 2: To partition the database, tokenize, and label code token categories using the process.py.
# change the dir_path to the location of 'Froyo_Email' in your computer, then
python process.py 

The data used in the paper can be found here. Here's an explanation of the provided files:

  • train.txt: This file is used to build the database. Each line contains a complete method body that has already been tokenized.
  • test.txt: This file is used to test token-level completion performance. Similar to the train.txt file, each line contains a complete method body that has already been tokenized.
  • *_type.txt: These files contain information about token types corresponding to the tokens.

2.2 Build Intra-Scenario Dataset

We utilized the APIBench[2] dataset to validate code completion performance in the same application scenario. For token-level code completion, we used the training set from APIBench to construct the database and the test set to evaluate completion performance. You can find the token-level completion data here.

For Android scenario line-level completion, we employed JavaParser to identify code lines containing Android API calls, allowing us to extract complete signatures for Android APIs. The processed code has been uploaded here, and the processed data can be found here. The json format is like:

{
    "input": "code context to be completed",
    "gt": "the ground truth of next code line",
    "api_signature": "the signature of the Android API used in the line"
}

3. Evaluate

3.1 Intra-Project token-level Completion (RQ1):

Tip: Since the dataset for the project is relatively small in scale, we do not use faiss to train vector database.

# These params are used for all models. So for any model you want to use, run this script first.

model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-java-adaptedGPT2
# or use unixcoder 
# model_type=unixCoder
# pretrained_dir=microsoft/unixcoder-base

data_dir=./dataset/intra_project_completion
project_dir=${data_dir}/dataset/large_projects/AmazeFileManager
lit_file=${data_dir}/literals.json
output_dir=./save/intra_project/AmazeFileManager/${model_type}
- Base model
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${project_dir} \
	--lit_file=${lit_file} \
    --langs=java \
    --output_dir=${output_dir}/base \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token
    
# expect output 
# word acc: [54.82]
- kNN-LM [3]
dstore_dir=${output_dir}/knn_lm/db    # path to store database
dstore_size=100000000    # the max number of tokens in the db
k=8    # set for k-nearest neighbors search

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${project_dir} \
	--lit_file=${lit_file} \
    --langs=java \
    --output_dir=${output_dir}/knn_lm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token \
    --with_knn \
    --build_index \
    --dstore_dir=${dstore_dir} \
    --dstore_size=${dstore_size} \
    --k=${k}

# expect output
# word acc: [58.65]
- kNM-LM (Ours)
dstore_dir=${output_dir}/knm_lm/db    # path to store database
dstore_size=100000000    # the max number of tokens in the db
k=8    # set for k-nearest neighbors search

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${project_dir} \
	--lit_file=${lit_file} \
    --langs=java \
    --output_dir=${output_dir}/knm_lm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token \
    --with_knn \
    --build_index \
    --dstore_dir=${dstore_dir} \
    --dstore_size=${dstore_size} \
    --k=${k} \
    --only_errors \
    --use_bayes

# expect output
# word acc: [68.38]
- BM25
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
 	--data_dir=${project_dir} \
    --lit_file=${lit_file} \
    --langs=java  \
    --output_dir=${output_dir}/bm25 \
    --pretrain_dir=${pretrained_dir} \
    --model_type=${model_type} \
    --dstore_file=${project_dir}/train.txt \
    --data_process  \
    --build_index \
    --do_search \
    --do_generate \
    --use_bm25 \
    --bm_name amazefilemanager
    
# expect result
# word acc: 0.56879

3.2 Intra-Scenario token-level Completion (RQ2):

Tip: We use faiss to train vector database for faster searching.

# These params are used for all models. So for any model you want to use, run this script first.

model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-py-adaptedGPT2
# or use unixcoder 
# model_type=unixCoder
# pretrained_dir=microsoft/unixcoder-base

data_dir=./dataset/intra_scenario_completion/python
scenario_dir=${data_dir}/token_completion/DL
lit_file=${data_dir}/literals.json
output_dir=./save/intra_scenario/python/token/DL/${model_type}
- Base model
CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${scenario_dir} \
	--lit_file=${lit_file} \
    --langs=python  \
    --output_dir=${output_dir}/base \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token
- kNN-LM
dstore_dir=${output_dir}/knn_lm/db    # path to store database

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${scenario_dir} \
	--lit_file=${lit_file} \
    --langs=python  \
    --output_dir=${output_dir}/knnlm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token \
    --with_knn \
    --build_index \
    --need_knn_train
- kNM-LM (Ours)
dstore_dir=${output_dir}/knm_lm/db    # path to store database

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
	--data_dir=${scenario_dir} \
	--lit_file=${lit_file} \
    --langs=python  \
    --output_dir=${output_dir}/knmlm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=8 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_token \
    --with_knn \
    --build_index \
    --need_knn_train \
    --only_errors \
    --use_bayes
- ReACC
# step 1. bm25 search
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
	--data_dir=${scenario_dir} \ 
	--lit_file=${lit_file} \
    --output_dir=${output_dir}/reacc \
    --pretrain_dir=${pretrained_dir} \
    --model_type=${model_type} \
    --dstore_file=${scenario_dir}/train.txt \
    --data_process \
    --do_search \
    --do_generate \
    --use_bm25 \
    --bm_name py_dl

# step 2. dense search
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
	--data_dir=${scenario_dir} \ 
	--lit_file=${lit_file} \
    --output_dir=${output_dir}/reacc \
    --pretrain_dir=${pretrained_dir} \
    --model_type=${model_type} \
    --dstore_file=${scenario_dir}/train.txt \
    --do_search \
    --do_generate \
    --use_dense \
    --build_index
    
# step 3. hybrid
CUDA_VISIBLE_DEVICES=0 python ./reacc/run_token_completion.py \
	--data_dir=${scenario_dir} \ 
	--lit_file=${lit_file} \
    --output_dir=${output_dir}/reacc \
    --pretrain_dir=${pretrained_dir} \
    --model_type=${model_type} \
    --dstore_file=${scenario_dir}/train.txt \
    --do_search \
    --do_generate \
    --use_hybrid

3.3 Complete code line with Android APIs

- kNM-LM (Ours)
model_type=gpt2
pretrained_dir=microsoft/CodeGPT-small-java-adaptedGPT2

data_dir=./dataset/intra_scenario_completion/java
lit_file=${data_dir}/literals.json
output_dir=./save/intra_scenario/java/line/Android/${model_type}

dstore_dir=${output_dir}/knm_lm/db

# 1. build the database (same as token completion for Android.)
scenario_dir=${data_dir}/token_completion/Android

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
    --data_dir=${scenario_dir} \
    --lit_file=${lit_file} \
    --langs=java \
    --output_dir=${output_dir}/knm_lm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=4 \
    --logging_steps=100 \
    --seed=42 \
    --build_index \
    --with_knn \
    --need_knn_train \
    --dstore_dir=${dstore_dir} \
    --only_errors \
    --use_bayes

# step 2. inference next line.
scenario_dir=${data_dir}/line_completion/Android

CUDA_VISIBLE_DEVICES=0 python ./code/run_lm.py \
    --data_dir=${scenario_dir} \
    --lit_file=${lit_file} \
    --langs=java \
    --output_dir=${output_dir}/knm_lm \
    --pretrain_dir=${pretrained_dir} \
    --log_file=log.log \
    --model_type=${model_type} \
    --block_size=1024 \
    --per_gpu_eval_batch_size=4 \
    --logging_steps=100 \
    --seed=42 \
    --do_eval_line \
    --with_knn \
    --dstore_dir=${dstore_dir} \
    --only_errors \
    --use_bayes

I found a bug in line-level code completion and have fixed it. So the result of line-level code completion may be a little different from that in the paper. The bug code is in 684 line of ./code/run_lm.py and 68 line of ./reacc/run_line_com.py

Before: model_outputs = model(inputs)

After: model_outputs = model(inputs[:, :-1])

Reference

[1] Egor Bogomolov, Sergey Zhuravlev, Egor Spirin, Timofey Bryksin:Assessing Project-Level Fine-Tuning of ML4SE Models. CoRR abs/2206.03333 (2022)

[2] Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu:Revisiting, Benchmarking and Exploring API Recommendation: How Far Are We? IEEE Trans. Software Eng. 49(4): 1876-1897 (2023)

[3] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis:Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020

Acknowledge

This repository is inspired by the code from this repository: https://github.com/neulab/knn-transformers. We greatly appreciate the authors for providing their code.

ase2023_knm-lm's People

Contributors

mf1832146 avatar zetang94 avatar

Stargazers

omer avatar  avatar Keyu Liang avatar Tim van Dam avatar Tingwei Zhu avatar Wenkang Zhong avatar Guang Yang avatar fengji.zhang avatar Sherloque avatar nashid avatar  avatar Andres Roman avatar

Watchers

nashid avatar  avatar

ase2023_knm-lm's Issues

How to use this framework?

请教一下如何使用kNN-LM。

我的理解是第一步先微调出相应的模型。
第二步将训练集中的数据构造为FAISS index。
第三步就是在预测的时候进行概率加权融合。

我比较困惑的是第二步和第三步的具体使用方法,能否在readme中说明,如何一步地一步地使用?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.