refactoring-ai / data-collection Goto Github PK

Collect refactorings with metrics from java source code.

License: MIT License

Shell 2.46% Java 97.54%

database software-engineering refactoring metrics machine-learning docker appendix java dataset mysql

data-collection's Introduction

Machine Learning for Software refactoring

This repository contains the data-collection part on the use of machine learning methods to recommend software refactoring that collects refactoring and non-refactoring instances from java source code that are later used to train the ML algorithms with a large variety of metrics.

Quickstart

Prepare a MariaDB instance and create a database with the name refactoring_ai (if you have docker a quick way is: docker run -p 127.0.0.1:3306:3306 --name some-mariadb -e MYSQL_ROOT_PASSWORD=root -d mariadb these are also the default credentials)
Build jar with dependencies: ./gradlew quarkusBuild
Define the projects to mine refactorings in input.csv
Start mining: java -jar java -jar build/datacollection-0.1.0-runner.jar

Configuration

Configuration can be done with environment variables. You can also create an .env file with the variables. See .env.example for each variable and its explanation

Paper and appendix

The paper can be found here: https://arxiv.org/abs/2001.03338
The raw dataset can be found here: https://zenodo.org/record/3547639
The appendix with our full results can be found here: https://zenodo.org/record/3583980

The data collection tool

Dependencies

Java 11, or higher

Database Clean-up

The enormous variety refactoring types, projects and programming styles in the mined repositories can lead to various issues with the data. Therefore, we explain two common problems and potential solutions here.

!MAKE A COPY OF YOUR DATA BEFORE ALTERING IT!

Remove unfinished projects

The data-collection tool can fail due to various reasons, e.g. OutOfMemoryErrors, unhandled Exceptions in the mining phase, in will thus not finish all projects. If you want to rerun the data-collection or remove unfinished projects for other reasons, you can do this with the following commands:

/*
Remove refactoring and stable instances, before the projects.
DELETE
	RefactoringCommit
FROM
	RefactoringCommit
	INNER JOIN 
		CommitMetaData ON RefactoringCommit.commitmetadata_id = CommitMetaData.id
	INNER JOIN 
		ClassMetric ON refactoringcommit.classMetrics_id = ClassMetric.id
	LEFT JOIN 
		ProcessMetrics ON refactoringcommit.processMetrics_id = ProcessMetrics.id
	LEFT JOIN 
		FieldMetric ON refactoringcommit.fieldMetrics_id = FieldMetric.id
	LEFT JOIN 
		MethodMetric ON refactoringcommit.methodMetrics_id = MethodMetric.id
	LEFT JOIN 
		VariableMetric ON refactoringcommit.variableMetrics_id = VariableMetric.id
WHERE
	RefactoringCommit.project_id IN 
	(SELECT id FROM project WHERE finishedDate IS NULL);

DELETE
	StableCommit
FROM
	StableCommit
	INNER JOIN 
		CommitMetaData ON StableCommit.commitmetadata_id = CommitMetaData.id
	INNER JOIN 
		ClassMetric ON StableCommit.classMetrics_id = ClassMetric.id
	LEFT JOIN 
		ProcessMetrics ON StableCommit.processMetrics_id = ProcessMetrics.id
	LEFT JOIN 
		FieldMetric ON StableCommit.fieldMetrics_id = FieldMetric.id
	LEFT JOIN 
		MethodMetric ON StableCommit.methodMetrics_id = MethodMetric.id
	LEFT JOIN 
		VariableMetric ON StableCommit.variableMetrics_id = VariableMetric.id
WHERE
	StableCommit.project_id IN 
	(SELECT id FROM project WHERE finishedDate IS NULL);
	
DELETE
	project
FROM 
	project 
WHERE 
	finishedDate IS NULL;
*/

Invalid Refactorings

Refactoring-miner is a great tool and quite stable, but not perfect. Thus, in some occasions it detects enormous numbers of refactorings for single class files, these seem to be in-correct and should be marked in-valid.

/*
Mark all in-valid refactorings in the RefactoringCommit table. This allows you manually inspect potentially in-valid refactorings and decide how to handle them.
In the last line you specify the threshold of refactorings on the same commit and the same class file to be considered in-valid.
*/

UPDATE 
	RefactoringCommit
SET 
	isValid = FALSE
WHERE
	commitMetaData_id IN
	(SELECT Distinct
		commitMetaData_id
	FROM
		RefactoringCommit
	GROUP BY
		commitMetaData_id, className
	HAVING
		COUNT(className) >= 50);

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.

data-collection's People

Contributors

Stargazers

Watchers

Forkers

r4phael

data-collection's Issues

Arbitrary files are missing during project initialization

During data-collection I noticed that sometimes files are missing during project initialization:

worker_19 | /data-collection_worker_19 2020-05-22 13:48:21 INFO App:201 For project: https://github.com/phonegap-build/FacebookConnect the project size could not be determined. worker_19 | java.lang.IllegalArgumentException: /tmp/1590155289563-0/repo/platforms/android/cordova/node_modules/.bin/shjs does not exist worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2413) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.App.initProject(App.java:199) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.App.run(App.java:126) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.processRepository(RunQueue.java:128) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.processResponse(RunQueue.java:82) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.run(RunQueue.java:73) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.main(RunQueue.java:60) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_6 | /data-collection_worker_6 2020-05-22 13:48:21 INFO App:140 Start mining project https://github.com/mohak1712/Insta-Chat(clone at /tmp/1590155297569-0/repo) worker_19 | /data-collection_worker_19 2020-05-22 13:48:21 INFO App:140 Start mining project https://github.com/phonegap-build/FacebookConnect(clone at /tmp/1590155289563-0/repo)
worker_15 | /data-collection_worker_15 2020-05-22 13:49:43 INFO App:201 For project: https://github.com/titanpad/titanpad the project size could not be determined. worker_15 | java.lang.IllegalArgumentException: /tmp/1590155378230-0/repo/etherpad/src/etherpad/globals.js does not exist worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2413) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.App.initProject(App.java:199) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.App.run(App.java:126) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.processRepository(RunQueue.java:128) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.processResponse(RunQueue.java:82) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.run(RunQueue.java:73) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.main(RunQueue.java:60) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | /data-collection_worker_15 2020-05-22 13:49:43 INFO App:140 Start mining project https://github.com/titanpad/titanpad(clone at /tmp/1590155378230-0/repo)

Line Deletions

Examples for incorrect line additions and deletions tracking:

1.

Commit: krishagni/openspecimen@beef30d
Java File: WEB-INF/src/com/krishagni/catissueplus/core/common/events/AbstractListCriteria.java
Lines Added (Detected - Total): 108
Lines Deleted (Detected - Total): 2
Lines Added (Manual - Total): 27 (1+ 10 + 12 + 4)
Lines Deleted (Manual - Total): 2 (2)

2.

Commit: lsfusion/platform@da7c6ed
Filehistory: https://github.com/lsfusion/platform/commits/da7c6edcc3a729240cabd86ea3868073f016aaf6/platform/server/src/main/java/platform/server/session/PropertyChange.java
Lines Added (Detected - Total): 121
Lines Deleted (Detected - Total): 3
Lines Added (Manual - Total):
Lines Deleted (Manual - Total): 118 (28 + 15 + 15 + 4 + 7 + 15 + 3 + 7 + 20 + 1 + 3)

Missing Feature: Commit Number

We are not logging the current commit count in the commit-metadata.
We could use this for a more in-depth empirical analysis of the data, e.g. by clustering commits based on their spatial proximity in the history.

CK not being able to link class/methods with RMiner data

Sometimes, our tool can't link the method that was refactored (coming from RMiner) and the metrics that CK extracts.

Below, we'll show some examples for debugging.

Duplicate Metrices in Database

We collect lots of duplicates for all metrices (Process, Class, Method, Variable and Field), because we insert them into the database again and again even though did not change. I suggest to only insert new metrices into the db, if they are unique for the current commit. This would reduce database size and probably increase the speed of the data-collection, because we would not recompute these metrices.

Statistics

Cross-platform support

The tool is not working in Windows. Probable cause is how we deal with path names (the awesome / vs \ difference).

Database Clean-up

The collected data has various issues. Thus we should update the database clean-up section in the Readme, to incorporate them.

Issues

Unfinished projects
inValid refactoring instances

Add codecov.io to the repo

Let's add codecov.io, which allows us to follow our coverage!

(That would had been really useful at the beginning, though. Maybe a bit less now given that we are about to deploy, but ... I think we'll code this forever hehe)

Bug fix detection

During the data validation I noticed that the detection of bug fixes based on the commit messages is sometimes incorrect. I will add some examples here:

Examples:

Potential Causes:

We check if a commit message contains one of the keywords. This can lead to false positives: e.g. see Example 3.
https://github.com/refactoring-ai/predicting-refactoring-ml/blob/fdf3753fb35b4d4e699d1f6fbb2329441d11988a/data-collection/src/main/java/refactoringml/ProcessMetricTracker.java#L22-L27

Log analysis missing in the Readme.

We log a lot of data during the data-collection and have an analysis tool for it, but we do not explain it in the Readme.

ToDo's:

Update and check loganalyzer script
Explain usage in the Readme

Missing Features

(WIP)
During the analysis of the collected data I came along some missing features.

Commit Number Since Last Refactoring : CommitMetaData stores the number of commits since the last refactoring for every refactoring. We already track this to determine StableCommits, but don't collect it in the data.

Redundant or Invalid metrics

A collection of potentially redundant metrics.

Class-Level

NumberOfSynchronizedFields

only 7 instances are not zero in the entire database

Method-Level

methodSubClassesQty

only 31864 instances are not zero in the entire database
I could not link this to specific refactorings nor projects

Implement relative process metrics

See the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7712&rep=rep1&type=pdf (as suggested in refactoring-ai/predicting-refactoring-ml#130).

Implementing it would be nice as a way to measure if it also works for our domain. However, we are trying to get away of process metrics, given that they are just too expensive to be calculated. Is it worth the effort?

JGit Diffentry misses file changes

Refactoringminer detects refactorings to files, that are not linked by Diffentry, e.g:

Error:
2020-03-13 18:49:17 ERROR RefactoringAnalyzer:94 Refactoring miner found a refactoring for a newly introduced class file on commit: 85c68373dabe32334933bdf6e67091534fc1504a for new class file: fluentlenium-core/src/main/java/org/fluentlenium/adapter/FluentTest.java

File on current commit:
https://github.com/FluentLenium/FluentLenium/blob/85c68373dabe32334933bdf6e67091534fc1504a/fluentlenium-core/src/main/java/org/fluentlenium/adapter/FluentTest.java

Same file on Commit Parent:
https://github.com/FluentLenium/FluentLenium/tree/fe28c1348e70c0f4c2dbb209f3a54695aa7ec9ff/fluentlenium-core/src/main/java/org/fluentlenium/core/test

Move in-memory db to a file-based hsql db

The process metrics database is, right now, a simple in-memory HashMap. That works, but for large projects, this map might become too big to fit in our small VMs.

One idea is to move it to a file-based HSQLDB database. After all, we often have more disk than memory available.

We should make sure to reset the file whenever the application starts (a force reset at the beginning, as the previous execution of application might not had ended gracefully..)

Missing Data Constraints

As of today, we have only very few constraints for the data in our database. A data constraint is an "assertion" over the data, e.g. the process metrics of a refactoring have to be higher or equal for later refactorings on the same file.
We do simple sanity checks in the Integration tests, especially the toy-projects, but the stress tests (#146 95) and canary tests showed that we missed many (edge) cases.

Advantages:

confidence in the data

For more inspiration look here: https://fontysblogt.nl/testing-machine-learning-applications/

refactoring-ai / data-collection Goto Github PK

data-collection's Introduction

Machine Learning for Software refactoring

Quickstart

Configuration

Paper and appendix

The data collection tool

Dependencies

Database Clean-up

Remove unfinished projects

Invalid Refactorings

Authors

License

data-collection's People

Contributors

Stargazers

Watchers

Forkers

data-collection's Issues

1.

2.

Statistics

Issues

Examples:

1.

2.

3.

Potential Causes:

ToDo's:

Class-Level

NumberOfSynchronizedFields

Method-Level

methodSubClassesQty

Advantages:

Recommend Projects

Recommend Topics

Recommend Org