variantsync / diffdetective Goto Github PK

View Code? Open in Web Editor NEW

7.0 6.0 4.0 59.89 MB

Library for Variability-Aware Differencing and the Analysis of Edits to Preprocessor-Based Software Product Lines

Home Page: https://variantsync.github.io/DiffDetective/

License: GNU Lesser General Public License v3.0

Java 83.76% Python 10.83% Shell 0.93% Haskell 2.19% Dockerfile 0.24% Batchfile 0.19% TeX 0.29% Nix 0.91% ANTLR 0.66%

diffing software-product-lines variability-analysis software-evolution differencing repository-mining variability

diffdetective's People

Contributors

Stargazers

Watchers

Forkers

guethilu thebormann sebastianwilke maximilian-glumann

diffdetective's Issues

Disabled `StarFold` tests

Currently, the StarFold tests are disabled because DiffTree.toTextDiff has been removed (see #70). As StarFold is currently unused we have the choice of either fixing the tests or removing StarFold.

Implement new atomic patterns

Refactor the atomic patterns to the new definition based on sat solving.

Mixed Terminology

Currently, DiffNodes have multiple methods to access their feature mapping. However, some of them actually compute presence conditions. Inspect what each method does and rename it if necessary to streamline our terminology.

Generation of ANTLR files requires ANTLR to be installed

We use ANTLR to generate a parser for formulas within C preprocessor macros. To regenerate the grammar, we have a script that runs ANTLR as follows:

DiffDetective/scripts/generateANTLRClasses.sh

Line 13 in aa468f0

antlr4 -o $OUTPUT_DIR -package "$PACKAGE" "$GRAMMAR_FILE"

However, this call requires antlr4 to be installed locally on the developer's machine. This might be tedious on windows. Given that we already have a maven dependency to antlr4, is there a way to replace this call to antlr4 with a call to Maven which in turn runs its antlr dependency? Alternatively, we could also ship the antlr4 jar file within the DiffDetective repository and run this jar.

Test Propositional Formula Parser

Currently, we have a test for the boolean abstraction in src/test/java/CPPParserTest.java, which tests that C preprocessor directives are correctly abstracted to boolean formulas (as strings). However, there are no tests yet, that these strings are then (in a second step) indeed parsed correctly to a propositional formula AST as expected. The CPPParserTest should be extended by that, or there should be a new Test class.

This issue should be tackled once PR #103 is merged too avoid merge conflicts and redundant effort.

Bug in preliminary MoveElse

During refactoring for #59 I found the following possible bug in preliminary/pattern/semantic/MoveElse.java. I did not test if this is actually a bug, but I'm pretty sure.

The following lines (38-42 on 93f836e)

Collection<DiffNode> commonAddElse = annotationNode.getAllChildren();
commonAddElse.retainAll(annotationNode.getParent(AFTER).getAllChildren());

Collection<DiffNode> commonRemElse = removedElse.getAllChildren();
commonRemElse.retainAll(annotationNode.getParent(AFTER).getAllChildren());

modify the collection returned by DiffNode.getAllChildren by using Collection.retainAll. The problem is that DiffNode.getAllChildren returns an unmodifiable collection since 0a4b318. Therefore this will result in an UnsupportedOperationException exception.

Brackets in variable names upon boolean abstraction

Upon boolean abstraction, names with brackets might be generated. For example #if defined(A) && (B * 2) > C
is abstracted to A && (B__MUL__2)__GT__C and the brackets become part of a variable's name. This is unfortunately confusing for some parsers or use cases. The brackets should be avoided if possible.

Consistent code formatting

Which code formatter should we use, without forcing developers into using a specific IDE?

@pmbittner is using Intellij with its default settings, right?
@AlexanderSchultheiss and I are using jdtls with its default settings.
Some code is committed without going through a code formatter.

Both are quite configurable. It seems like the settings of Intellij can be exported as a jdtls settings file called Eclipse profile. @pmbittner could you export your formatting settings, then I will check how different Intellij's and jdtls' formatters handle these settings. If there are no differences we can just commit both setting files into the repo and all current developers should be happy.
If they differ too much we should decide on one formatter which can be integrated into the different IDEs.

What do you think about a precommit hook or CI for code formatting to ensure the code is/stays consistent?

Crash Safety

DiffDetective runs for a few days when mining some repositories. It may crash. It would be great to continue work where it stopped last.

Find a way on how to realize such crash safety. Do we have to store all commits that were already processed? Is it sufficient to just remember the last successfully processed commit? In particular, does git.log.call() return an iterator with a total order so we can certainly find the commit to return to?

Pattern Mining Postprocessing

Implement the postprocessing phase for semantic pattern mining. It should consist of:

CutNonEditedSubtrees as non edited subtrees may occur if a non-edited subtree occured frequently and thus is part of a pattern
Filter ill-formed patterns that are either semantically or syntactically invalid.
Filter duplicates: This requires to solve graph isomorphism. Is there a good library we can use? If we have to implement it on our own, can we abuse the structure of DiffTrees for that?
Filter patterns containing less than two atomic patterns as these patterns cannot be semantic. We may want to count how many patterns we filtered here.

We may want to count how many patterns we filtered in each step.

Datasets for Replication Package

Create a new github user that forks all 44 datasets to freeze them. Do this in an automated way preferrably.

DiffGraph

To analyse edit patterns and to allow more flexible export of DiffTrees to linegraph, we need to relax our notion of tree. In particular, we should not force that our DiffTrees have one particular root as it might not appear in some patterns we want to analyse or subgraphs we want to export.

Task of this issue is to find a way to express this reasonably. We could add a new DiffGraph class but it would be cumbersome having to reimplement or copy most of the DiffTree implementation for that. We should check how many functions (e.g., the DiffTreeTransformers) rely on the tree being a well-formed tree.

Feature Identification

Currently, we just interpret every conditional macro we find as a feature annotation. This is not always true though. For example, include guards do not identify any variability in the software. We thus need a way to check if an annotation is indeed a feature annotation.

In linux for example, any feature starts with CONFIG_. We could thus check, if the string of a conditional macro contains this expression somewhere and if so, interpret the macro as a feature annotation or as plain code otherwise.

Parsing Diff to Working Tree

Currently, DiffDetective can only diff two commits. This is done in the GitDiffer class via GitDiffer::createCommitDiff. The GitDiffer should be refactored and extended to also enable diffing any commit with the current working tree. The CommitDiff class might have to be generalized, too, to just a Diff class between.

Parallelization of DiffTree Mining

Mining Linux data takes 14 days. We sould parallelize construction and export of DiffTrees.

Extend DiffTreeMiner to Read Sets of Input Repositories

Currently, the DiffTreeMiner runs on exactly one dataset. For larger analyses we have to be able to mine DiffTrees from more than one repository simultaneously.

Remove deprecated Mining package

The org.variantsync.diffdetective.mining package is from the early days of DiffDetective and not in use anymore, and will likely never be. Currently, it is more of a distraction and maintenance burden. Hence, it should be deleted. Some class might still be in use though or generally useful (e.g., the formats). These class should be moved to proper other packages. There are also some scripts and in the scripts directory as well as a python code base in the mining directory, both of which are not in use anymore and work together with the deprecated org.variantsync.diffdetective.mining package. So these could be deleted too. In any case, everything we delete here still survives in the git history in case we need it ever again.

Builder for RenderOptions

Creating RenderOptions is quite cumbersome because one has to specify many values. Most of the time, we are interested in just setting an explicit value for a few options while setting the other values to their default. It thus would be good to use the builder pattern to build RenderOptions incrementally.

Tasks for this issue:

Extract the diff.difftree.render.DiffTreeRenderer$RenderOptions to their own class.
Create a builder class within the RenderOptions class. You may take inspiration from the builder for DiffFilters. (Instead of storing a copy of all fields in the builder, you may store an instance of RenderOptions that you can modify (as starting value use RenderOptions.DEFAULT). This avoid code duplication. Here is also a cool blog article on the builder pattern (we don't need the director though and we only need a single builder type).

Implement a proper unparse method for `DiffTree`s

The old DiffTree.toTextDiff has been removed in #67 because it wasn't able to serialize DiffTrees with moved subtrees. We need a replacement that is able to cope with all DiffTrees properly.

In contrast to line graphs the resulting format should be human readable (e.g. for use in debug messages) and ideally injective to be used in test assertions.

UnsupportedClassVersion

Unfortunately, the functjonal jar file in the local maven repo seems to have been compile with Java 19(?) or newer. This makes compilation fail with lower java versions:

[ERROR]   TreeDiffingTest.lambda$testCases$0:68->testCase:80->parseVariationTree:118 » UnsupportedClassVersion org/variantsync/functjonal/Cast has been compiled by a more recent version of the Java Runtime (class file version 63.0), this version of the Java Runtime only recognizes class file versions up to 62.0

I got this error when trying to do mvn install with Java 18.

The functjonal library should be recompiled with the current minimal Java version (16). It would be great if we could find a way to prevent such errors in the future (e.g., by adding a check to the script for updating the local maven repository).

Move SimpleMetadata to Functjonal

The SimpleMetadata class is not specific to DiffDetective and could be moved into the Functjonal library. After this refactoring, it is important to refresh the local-maven-repository (see this script).

Extract a Dataset Class and Test DiffDetective on Other Datasets

Currently, we used only a specific old marlin sub-history as input for DiffDetective (repositories/Marlin_old.zip).

Add support for expressing different datasets. Currently, the setup for datasets is hardcoded in Main and DiffTreeMiner. It would be great to have something like a Repository class with pre-defined instances, for Linux, Marlin, Busybox, ...
Test the DiffTreeMiner with Linux as input. Do we find bugs? How many commits can we handle? Do we have to filter commits or files specifically? Is there a suitable subhistory we should inspect as it is unlikely that we will be able to process the entire history of Linux as it is just monstrous big.

Rename `AnalysisStrategy`

As discussed in #53 the name AnalysisStrategy is too generic.
It was suggested to call it VariationDiffExportStrategy instead.

Formula extraction: Nested parenthesis are not resolved correctly

Description

I found instances of #if macros that use nested parentheses. Such instances are currently not always extracted correctly, because the regex removes the leading and trailing parentheses.

Example

The formula #if(STDC == 1) && (defined(LARGE) || defined(COMPACT)) is extracted as STDC__EQ__1)&&(DEFINED_LARGE||DEFINED_COMPACT.

The expected correct result is __LB__STDC__EQ__1__RB__&&(DEFINED_LARGE||DEFINED_COMPACT). Note: Parenthesis belonging to boolean expression are abstracted as __LB__ and __RB__.

Diff Parsing: Multi-line macros with inline comments are not parsed completely

Description

Macros can be defined in multiple lines by using line continuations with \. Additionally, macros can also become multi-line macros, if they contain C-Style comments that span multiple lines. The latter case is currently not handled by the parser, which results in broken formulas being extracted.

Example

The following multi-line #if macro is possible, but not parsed completely:

 # if A && \
 /* inline
 comment
 with multiple
 lines */ \
  B \
  && D
+   baz();
-   vaz();
 #endif

Expected

The expected interpretation of the diff is:

 # if A && /* inline comment with multiple lines */   B   && D
+   baz();
-   vaz();
 #endif

Actual (Issue)

However, the actual interpretation by the parser is:

 # if A && /* inline 
comment 
with multiple 
lines */   B   && D
+   baz();
-   vaz();
 #endif

Proposed fix

The easiest solution is to track whether the current line is part of a C-style comment, and to consider the line incomplete if it is.

Delete Deprecated `preliminary` Package

The preliminary package is deprecated and in no active use. It causes harm and extra effort upon refactorings in the code base without benefit. Hence, it should be deleted. There should be no dependencies on this package but in case there are dependencies, these should be resolved.

Simpler Dataset Files

Currently, datasets are given as markdown files with lots of unused columns:

Project name	Domain	Source code available (yes/no)?	Is it a git repository (yes/no)?	Repository URL	Clone URL	Estimated number of commits
apache-httpd	web server	y	y	https://github.com/apache/httpd	https://github.com/DiffDetective/httpd.git	32,927
berkeley-db-libdb	database system	y	y	https://github.com/berkeleydb/libdb	https://github.com/DiffDetective/libdb.git	7

Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.

Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.

Formula extraction: Multi-line C-style comments are not filtered

Description

The CPPDiffLineFormulaExtractor filters comments in lines containing C preprocessor statements. Currently, only comments are filtered that are fully included in the line. Multi-line comments are not filtered.

This is a bug because it leads to invalid formulas.

Example

The extractor determines the invalid formula A&&BMulti-lineComment for the code snippet

#if A && B /* Multi-line
                Comment
                */

The expected result is A&&B.

Fix

Multi-line comments need to be filtered after inline comments have been filtered.

Specification of Commit Ranges to Consider

For Linux it might be necessary that we do not analyze the entire commit history (i.e., by using git.log().call() in GitDiffer) but only a window (or sub range) of commits. Thus, we want to be able to specify a startCommit and endCommit from a certain branch for a datasets. The GitDiffer should then only create CommitDiffs within the specified range. If the range is empty, the entire history should be considered (as before).

Complete Consistency Check for DiffTrees

As described in the paper, DiffTrees have the following well-formedness criteria:

Each node may contain at most one parent before the edit and at most one parent after the edit.

Part of this is automatically ensured as each DiffNode has exactly one field for an after parent and a before parent, respectively. Though, a DiffTree might get corrupted in the sense that different DiffNodes may have the same other DiffNode listed as their before child (or after child) simultaneously. This is implemented in DiffNode::assertConsistency().

Each node has at least one parent, except for one dedicated root node (that may represent an entire file for example).

Seems to be missing!

There are no cycles.

Is implemented via DiffTree::HasPathToRootCached.

Nodes that have exactly one parent, were edited.

Edited nodes have either diff type ADD or REM. If a node has a parent before the edit but not after the edit, it has to be a removed node. Analogously, if a node has a parent after the edit but not before the edit, it has to be an inserted node. This check is also not implemented yet.

Check Javadoc syntax of private members in CI

We do not generate Javadoc for private members (should we?) thus Javadoc doesn't check its validity (syntax and references). There are already some errors which can be seen by generating their Javadoc with this patch:

--- a/pom.xml
+++ b/pom.xml
@@ -24,6 +24,7 @@
                 <configuration>
                     <reportOutputDirectory>docs</reportOutputDirectory>
                     <destDir>javadoc</destDir>
+                    <show>private</show>
                     <quiet>true</quiet>
                 </configuration>
             </plugin>

and running mvn javadoc:javadoc.

We should automate this check in CI.

Linegraph importer

Currently, we can export CommitDiffs, PatchDiffs, and DiffTrees to linegraph. To validate mined edit patterns, we also need the ability import linegraph files as DiffTrees. It would be great to have one thing/class being responsible for both, read and write.

Pattern Collapsing

We have to be able to collapse detected patterns on the DiffTrees. In particular, we want to detect atomic patterns on the tree and then, collapse all matches to single nodes of type PATTERN_MATCH or something like that. First, see if and how such a collapse could be realised given the possible overlaps, e.g., when having two AddToPC of a two code nodes below the same IF. Second, add support for nodes representing such a collapsed pattern.

Export Metadata

Export all metadata we need for validation of DiffTree construction and atomic pattern matching (e.g., how often each atomic pattern was matched) along the metadata we already capture.

Use DiffDetective Logo in GUI

When opening the GUI of DiffDetective, the frame will show a Java icon at the top right corner. We should put our little DiffDetective in there! ;)

variantsync / diffdetective Goto Github PK

diffdetective's People

Contributors

Stargazers

Watchers

Forkers

diffdetective's Issues

Description

Example

Description

Example

Expected

Actual (Issue)

Proposed fix

Description

Example

Fix

Each node may contain at most one parent before the edit and at most one parent after the edit.

Each node has at least one parent, except for one dedicated root node (that may represent an entire file for example).

There are no cycles.

Nodes that have exactly one parent, were edited.

Recommend Projects

Recommend Topics

Recommend Org