Code Monkey home page Code Monkey logo

data-collection's Introduction

Machine Learning for Software refactoring

codecov

This repository contains the data-collection part on the use of machine learning methods to recommend software refactoring that collects refactoring and non-refactoring instances from java source code that are later used to train the ML algorithms with a large variety of metrics.

Quickstart

  1. Prepare a MariaDB instance and create a database with the name refactoring_ai (if you have docker a quick way is: docker run -p 127.0.0.1:3306:3306 --name some-mariadb -e MYSQL_ROOT_PASSWORD=root -d mariadb these are also the default credentials)
  2. Build jar with dependencies: ./gradlew quarkusBuild
  3. Define the projects to mine refactorings in input.csv
  4. Start mining: java -jar java -jar build/datacollection-0.1.0-runner.jar

Configuration

Configuration can be done with environment variables. You can also create an .env file with the variables. See .env.example for each variable and its explanation

Paper and appendix

The data collection tool

Dependencies

  • Java 11, or higher

Database Clean-up

The enormous variety refactoring types, projects and programming styles in the mined repositories can lead to various issues with the data. Therefore, we explain two common problems and potential solutions here.

!MAKE A COPY OF YOUR DATA BEFORE ALTERING IT!

Remove unfinished projects

The data-collection tool can fail due to various reasons, e.g. OutOfMemoryErrors, unhandled Exceptions in the mining phase, in will thus not finish all projects. If you want to rerun the data-collection or remove unfinished projects for other reasons, you can do this with the following commands:

/*
Remove refactoring and stable instances, before the projects.
DELETE
	RefactoringCommit
FROM
	RefactoringCommit
	INNER JOIN 
		CommitMetaData ON RefactoringCommit.commitmetadata_id = CommitMetaData.id
	INNER JOIN 
		ClassMetric ON refactoringcommit.classMetrics_id = ClassMetric.id
	LEFT JOIN 
		ProcessMetrics ON refactoringcommit.processMetrics_id = ProcessMetrics.id
	LEFT JOIN 
		FieldMetric ON refactoringcommit.fieldMetrics_id = FieldMetric.id
	LEFT JOIN 
		MethodMetric ON refactoringcommit.methodMetrics_id = MethodMetric.id
	LEFT JOIN 
		VariableMetric ON refactoringcommit.variableMetrics_id = VariableMetric.id
WHERE
	RefactoringCommit.project_id IN 
	(SELECT id FROM project WHERE finishedDate IS NULL);

DELETE
	StableCommit
FROM
	StableCommit
	INNER JOIN 
		CommitMetaData ON StableCommit.commitmetadata_id = CommitMetaData.id
	INNER JOIN 
		ClassMetric ON StableCommit.classMetrics_id = ClassMetric.id
	LEFT JOIN 
		ProcessMetrics ON StableCommit.processMetrics_id = ProcessMetrics.id
	LEFT JOIN 
		FieldMetric ON StableCommit.fieldMetrics_id = FieldMetric.id
	LEFT JOIN 
		MethodMetric ON StableCommit.methodMetrics_id = MethodMetric.id
	LEFT JOIN 
		VariableMetric ON StableCommit.variableMetrics_id = VariableMetric.id
WHERE
	StableCommit.project_id IN 
	(SELECT id FROM project WHERE finishedDate IS NULL);
	
DELETE
	project
FROM 
	project 
WHERE 
	finishedDate IS NULL;
*/

Invalid Refactorings

Refactoring-miner is a great tool and quite stable, but not perfect. Thus, in some occasions it detects enormous numbers of refactorings for single class files, these seem to be in-correct and should be marked in-valid.

/*
Mark all in-valid refactorings in the RefactoringCommit table. This allows you manually inspect potentially in-valid refactorings and decide how to handle them.
In the last line you specify the threshold of refactorings on the same commit and the same class file to be considered in-valid.
*/

UPDATE 
	RefactoringCommit
SET 
	isValid = FALSE
WHERE
	commitMetaData_id IN
	(SELECT Distinct
		commitMetaData_id
	FROM
		RefactoringCommit
	GROUP BY
		commitMetaData_id, className
	HAVING
		COUNT(className) >= 50);

Authors

This project was initially envisioned by Maurício Aniche, Erick Galante Maziero, Rafael Durelli, and Vinicius Durelli.

License

This project is licensed under the MIT license.

data-collection's People

Contributors

dahny avatar dependabot[bot] avatar dvanderleij avatar egmaziero avatar jan-gerling avatar macro-mancer avatar mauricioaniche avatar rafadurelli avatar rdurelli avatar v2vivar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

r4phael

data-collection's Issues

Arbitrary files are missing during project initialization

During data-collection I noticed that sometimes files are missing during project initialization:

worker_19 | /data-collection_worker_19 2020-05-22 13:48:21 INFO App:201 For project: https://github.com/phonegap-build/FacebookConnect the project size could not be determined. worker_19 | java.lang.IllegalArgumentException: /tmp/1590155289563-0/repo/platforms/android/cordova/node_modules/.bin/shjs does not exist worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2413) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.App.initProject(App.java:199) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.App.run(App.java:126) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.processRepository(RunQueue.java:128) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.processResponse(RunQueue.java:82) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.run(RunQueue.java:73) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_19 | at refactoringml.RunQueue.main(RunQueue.java:60) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_6 | /data-collection_worker_6 2020-05-22 13:48:21 INFO App:140 Start mining project https://github.com/mohak1712/Insta-Chat(clone at /tmp/1590155297569-0/repo) worker_19 | /data-collection_worker_19 2020-05-22 13:48:21 INFO App:140 Start mining project https://github.com/phonegap-build/FacebookConnect(clone at /tmp/1590155289563-0/repo)
worker_15 | /data-collection_worker_15 2020-05-22 13:49:43 INFO App:201 For project: https://github.com/titanpad/titanpad the project size could not be determined. worker_15 | java.lang.IllegalArgumentException: /tmp/1590155378230-0/repo/etherpad/src/etherpad/globals.js does not exist worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2413) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOf(FileUtils.java:2417) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2479) ~[data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.App.initProject(App.java:199) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.App.run(App.java:126) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.processRepository(RunQueue.java:128) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.processResponse(RunQueue.java:82) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.run(RunQueue.java:73) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | at refactoringml.RunQueue.main(RunQueue.java:60) [data-collection-0.0.1-SNAPSHOT-jar-with-dependencies.jar:?] worker_15 | /data-collection_worker_15 2020-05-22 13:49:43 INFO App:140 Start mining project https://github.com/titanpad/titanpad(clone at /tmp/1590155378230-0/repo)

Line Deletions

Examples for incorrect line additions and deletions tracking:

1.

Commit: krishagni/openspecimen@beef30d
Java File: WEB-INF/src/com/krishagni/catissueplus/core/common/events/AbstractListCriteria.java
Lines Added (Detected - Total): 108
Lines Deleted (Detected - Total): 2
Lines Added (Manual - Total): 27 (1+ 10 + 12 + 4)
Lines Deleted (Manual - Total): 2 (2)

2.

Commit: lsfusion/platform@da7c6ed
Filehistory: https://github.com/lsfusion/platform/commits/da7c6edcc3a729240cabd86ea3868073f016aaf6/platform/server/src/main/java/platform/server/session/PropertyChange.java
Lines Added (Detected - Total): 121
Lines Deleted (Detected - Total): 3
Lines Added (Manual - Total):
Lines Deleted (Manual - Total): 118 (28 + 15 + 15 + 4 + 7 + 15 + 3 + 7 + 20 + 1 + 3)

Missing Feature: Commit Number

We are not logging the current commit count in the commit-metadata.
We could use this for a more in-depth empirical analysis of the data, e.g. by clustering commits based on their spatial proximity in the history.

Duplicate Metrices in Database

We collect lots of duplicates for all metrices (Process, Class, Method, Variable and Field), because we insert them into the database again and again even though did not change. I suggest to only insert new metrices into the db, if they are unique for the current commit. This would reduce database size and probably increase the speed of the data-collection, because we would not recompute these metrices.

Statistics

image
image
image
image
image
image

Cross-platform support

The tool is not working in Windows. Probable cause is how we deal with path names (the awesome / vs \ difference).

Database Clean-up

The collected data has various issues. Thus we should update the database clean-up section in the Readme, to incorporate them.

Issues

  1. Unfinished projects
  2. inValid refactoring instances

Add codecov.io to the repo

Let's add codecov.io, which allows us to follow our coverage!

(That would had been really useful at the beginning, though. Maybe a bit less now given that we are about to deploy, but ... I think we'll code this forever hehe)

Bug fix detection

Log analysis missing in the Readme.

We log a lot of data during the data-collection and have an analysis tool for it, but we do not explain it in the Readme.

ToDo's:

  1. Update and check loganalyzer script
  2. Explain usage in the Readme

Missing Features

(WIP)
During the analysis of the collected data I came along some missing features.

  1. Commit Number Since Last Refactoring : CommitMetaData stores the number of commits since the last refactoring for every refactoring. We already track this to determine StableCommits, but don't collect it in the data.

Redundant or Invalid metrics

A collection of potentially redundant metrics.

Class-Level

NumberOfSynchronizedFields

  • only 7 instances are not zero in the entire database

Method-Level

methodSubClassesQty

  • only 31864 instances are not zero in the entire database
  • I could not link this to specific refactorings nor projects
    image
    image
    image

JGit Diffentry misses file changes

Related to issue refactoring-ai/predicting-refactoring-ml#165

Refactoringminer detects refactorings to files, that are not linked by Diffentry, e.g:

Error:
2020-03-13 18:49:17 ERROR RefactoringAnalyzer:94 Refactoring miner found a refactoring for a newly introduced class file on commit: 85c68373dabe32334933bdf6e67091534fc1504a for new class file: fluentlenium-core/src/main/java/org/fluentlenium/adapter/FluentTest.java

File on current commit:
https://github.com/FluentLenium/FluentLenium/blob/85c68373dabe32334933bdf6e67091534fc1504a/fluentlenium-core/src/main/java/org/fluentlenium/adapter/FluentTest.java

Same file on Commit Parent:
https://github.com/FluentLenium/FluentLenium/tree/fe28c1348e70c0f4c2dbb209f3a54695aa7ec9ff/fluentlenium-core/src/main/java/org/fluentlenium/core/test

image
image

Move in-memory db to a file-based hsql db

The process metrics database is, right now, a simple in-memory HashMap. That works, but for large projects, this map might become too big to fit in our small VMs.

One idea is to move it to a file-based HSQLDB database. After all, we often have more disk than memory available.

We should make sure to reset the file whenever the application starts (a force reset at the beginning, as the previous execution of application might not had ended gracefully..)

Missing Data Constraints

As of today, we have only very few constraints for the data in our database. A data constraint is an "assertion" over the data, e.g. the process metrics of a refactoring have to be higher or equal for later refactorings on the same file.
We do simple sanity checks in the Integration tests, especially the toy-projects, but the stress tests (#146 95) and canary tests showed that we missed many (edge) cases.

Advantages:

  1. confidence in the data

For more inspiration look here: https://fontysblogt.nl/testing-machine-learning-applications/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.