Code Monkey home page Code Monkey logo

liicd's Introduction

Build Status

LIICD

This repository hosts two Python implementations of a language-agnostic Incremental Clone Detector capable of detecting Type-1 clones. Both tools have been developed in the context of a Master thesis study in collaboration with the Software Improvement Group (SIG) in the Netherlands.

Implementations

  1. The LIICD (under /original) implements Hummel's clone-index based approach (skipping the normalization step)
  2. The LSH-based (under /LSH-based) utilizes Locality Sensitive Hashing (LSH) to calculate the clones for files that were found to be similar.

Requirements

  • Python: 3.7+
  • For both sub-projects, install dependencies via pip install requirements.txt

Usage

1. Generate Config file

Both implementations can be run via the main script main.py. Before that, the generation of a configuration file indicating the commits to be analyzed, is necessary. This can be done through the generate_config.py script which takes a git-tracked repository and the number of commits as parameters and generates the desired file. The format of such a file looks as follows:

{
    "commits": [
        {
            "id": "cb8f645e0f",
            "changes": [
                {
                    "type": "M",
                    "filename": "lib/ansible/plugins/loader.py"
                }
            ]
        },
        {
            "id": "564907d8ac",
            "changes": [
                {
                    "type": "A",
                    "filename": "changelogs/fragments/distribution_test_refactor.yml"
                },
                {
                    "type": "R",
                    "filename": [
                        "test/units/module_utils/facts/system/distribution/__init__.py",
                        "test/units/module_utils/facts/system/__init__.py"
                    ]
                },
                {
                    "type": "D",
                    "filename": "test/units/module_utils/facts/system/distribution/fixtures/arch_linux_na.json"
                }
            ]
        }
    ]
}

The generated configuration file, stored under configurations/{project_name} must then be given as argument to the main.py script.

2. Run the Detector

The next step is to run the desired implementation, passing the required arguments. These are, the path to the repo to be analyzed, the path to the config file and the number of commits (included in that config file).

python -m detector.main -p ~/projects/{my_project}/ -u ~/CloneDetector/configurations/{my_project}_updates.json -c 50

Note: Ensure that the codebase is checked-out @HEAD.

main.py Arguments:

  • -p: The path to the software project to be analyzed
  • -u: The path to the configuration file that holds the commits to be analyzed
  • -c: The number of commits to be analyzed

generate_config.py Arguments:

  • -p: The path of the codebase for which we generate the config
  • -n: The number of commits to analyze (default 10)

Configuration Parameters

Both implementations include configuration parameters that allow for additional tuning. These, along with their default values, can be found in the config.py file of each subdirectory.

LIICD

  • CHUNK_SIZE: The number of lines for each consecutive block to be hashed. Consequently, defines the length of the minimum clone. (default: 6)
  • COMMITS: The number of subsequent commits to analyze. The repository under analysis must have at least COMMITS + 1 commits since the intermediate data are constructed from HEAD-COMMITS-1. (default: 2)
  • SKIP_DIRS: A list of directories that are excluded from the analysis.
  • SKIP_FILES: A list of file extensions that are excluded from the analysis.

LSH-based

  • CHUNK_SIZE: Identical to LIICD.
  • COMMITS: Identical to LIICD.
  • SKIP_DIRS: Identical to LIICD.
  • SKIP_FILES: Identical to LIICD.
  • THRESHOLD: The threshold of similarity based on which the files are compared. (default: 0.2)
  • PERMUTATIONS:: The number of hash functions that are used for the generation of the MinHash signature. Affects the error rate. (default: 68)

liicd's People

Contributors

agamvrinos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

liicd's Issues

File renames caussing issues with detection

Currently the generate_config.py script, when parsing the commits of a repository, only saves the file changes that correspond to Modifications (M), Deletions (D) or Additions (A) statuses. However, when a file is moved from one place to another or when it gets renamed, git uses another status tag, in the format of R{number}, for this.

The problem that might occur is the following:

  1. The detector checks out to HEAD~2 to handle the commit that corresponds to that project snapshot
  2. The detector reads the file changes for that commit from the configuration file but since renames are not tracked, there are no related changes, hence the index does not get updated with the new location or filename
  3. The detector checks out to HEAD~1 to handle the commit that corresponds to that project snapshot
  4. The commit includes changes that refer to the now renamed/moved file.
  5. The detector tries to find the file in the index but since the index is not updated, it fails.

Replace print statements with file logging

Replace print statements with file logging to be able to revisit the output of the detector for each commit that was processed, even after the termination of the app.

List of Invalid/Binary files handling not exhaustive

When reading a codebase to create the initial Clone Index, the application skips binary and invalid (non-unicode) files. This happens on the basis of a list that includes the extensions that should be skipped. This list is not exhaustive though, in the sense that there might be binary file extensions that have not been included.

In such a case, the issue would be the following:

  1. The application would ignore the file due to the inability to read it (an exception is thrown but the application continues by ignoring the file)
  2. A commit includes changes that affect the specific file. For instance, a .jpg file (assuming .jpg is not handled, which is) was renamed.
  3. The application tries to find the file in the index, but since it was not processed when the codebase was read, it fails.

Possible Solutions

  1. Prior to reading it, try to do an initial check to see if a file is binary or invalid. For binary files, there are libraries that do this but they do so probabilistically, so these are still not suitable.
  2. Update the application to only consider file extensions that form the majority of the codebase. For example for a Java-based project, only consider .java extensions or maybe a combination of extensions in case the project uses multiple languages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.