This folder contains code for Bug Localization benchmark. Challenge: given an issue with bug description, identify the files within the project that need to be modified to address the reported bug.
We provide scripts for data collection and processing, data exploratory analysis as well as several baselines implementations for the task solution.
We provide dependencies for pip dependency manager, so please run the following command to install all required packages:
pip install -r requirements.txt
Bug Localization task: given an issue with bug description, identify the files within the project that need to be modified to address the reported bug
All data is stored in HuggingFace ๐ค. It contains:
-
Dataset with bug localization data (with issue description, sha of repo with initial state and to the state after issue fixation). You can access data using datasets library:
from datasets import load_dataset # Select a configuration from ["py", "java", "kt", "mixed"] configuration = "py" # Select a split from ["dev", "train", "test"] split = "dev" # Load data dataset = load_dataset("JetBrains-Research/lca-bug-localization", configuration, split=split)
where labels are:
dev
- all collected data
test
- manually selected data (labeling artifacts)
train
- all collected data which is not in test
and configurations are:
py
-- only.py
files in diff
java
-- only.java
files in diff
kt
-- only.kt
files in diff
mixed
-- at least on of the.py
,.java
or.kt
file and maybe file(s) with another extensions in diff -
Archived repos (from which we can extract repo content on different stages and get diffs which contains bugs fixations).
They are stored in.tar.gz
so you need to run script to load them and unzip:- Set
repos_path
in config to directory where you want to store repos - Run load_data_from_hf.py which will load all repos from HF and unzip them
- Set