Comments (21)
[@mjherzog comment] In most move cases, I would expect many lower-level path elements to be the same so we are really looking at path similarities from file name back up the tree.
from deltacode.
[@MaJuRG comment] If a file has the same filename and sha1 but not same path, it should be considered moved. we will have to consider collisions in this case though.
from deltacode.
Moved files may be in either the Add or Removed buckets, so this would be a good place to start when thinking about a possible fix.
from deltacode.
@MaJuRG I’ve deleted the modified code in DeltaCode.determine_delta()
and created a new method, DeltaCode.determine_moved()
, which uses a dictionary of removed
files, keyed by their sha1
attribute, to identify and move matching files from the added
and removed
categories to a new moved
category.
The DeltaCode
constructor and the DeltaCode.get_stats()
and DeltaCode.to_dict()
methods now include the moved
category, as does Delta.to_dict()
. I’ve modified the latter to display attributes for the new_file
/old_file
to enable the user to distinguish which moved
files would otherwise have been categorized as added
or removed
, but the resulting JSON display is a bit verbose. Here’s a sample excerpt from the JSON output for a file that would otherwise have been categorized as added
:
{
"deltacode_version": "0.0.1.beta",
"deltacode_stats": {
"added": 0,
"modified": 0,
"moved": 3,
"removed": 0,
"unmodified": 7
},
"deltas": [
{
"category": "moved",
"new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
"old_sha1": "",
"new_path": "b/a4.py",
"old_path": "",
"new_name": "a4.py",
"old_name": "",
"new_type": "file",
"old_type": "",
"new_size": 200,
"old_size": ""
},
. . .
Perhaps a better approach would be a single new attribute denoting added/removed or new/old. I haven’t yet modified the CSV-generation code but will once we’ve settled on the column headings and data to display for the moved
files.
In addition to fixing a group of failing tests, I’ve added three new tests. I also plan to add some CSV tests once we’ve settled on what to display for the moved
files, and I’d appreciate your suggestions on additional tests I could add to thoroughly vet the determine_moved()
method.
I’m available for an uberconf whenever it’s convenient for you.
from deltacode.
Ok, let me play around with this on my end.
from deltacode.
@johnmhoran After playing around with this, one of the first things I noticed is that we should have a single 'moved' object for an added/removed pair.
It looks like we have two separate moved objects being created at this time. For how we do output, etc this is not ideal. our 'moved' delta object should have both the new_
and old_
attributes filled out.
Hopefully that makes sense.
from deltacode.
@MaJuRG That was my original goal, but after working with my 3 test codebases I realized that pairing an added
and removed
is not always straightforward. For example, suppose a.py
is removed from the a
directory in old
codebase and added to the b
and the c
directories in the new
codebase. Do we pair old
a/a.py
with new
b/a.py
or new
c/a.py
? And what do we do with the second added copy of a.py
?
It was in light of this potential complexity that I chose to treat each added
and removed
file as a separate moved
file, thinking that we could then display them to the user as potential pairs by their common sha1
, name
and size
attributes. How would you suggest handling this scenario?
from deltacode.
We should probably just focus on the easiest cases for now, and pair/create moved objects for those.
This means, after we index, we should only look at places where len(added_index[sha1]) == 1
and len(removed_index[sha1]) == 1
. Then we know that this is the only place on both sides that this file can exist, so it is a singular move.
Once we can reliable calculate moves for the most simplest of cases, we can think about how to handle these other scenarios.
from deltacode.
In other words, because we are indexing by sha1
, there is the potential to have an arbitrary number of files that have the same sha1
value. Hence why it is not straightforward to handle moves, as you mentioned above.
So if we only look at the places where sha1
index corresponds to a single file, then we dont have to worry.
Yes, this will not catch all the possible moves, but it will get a good number of them.
from deltacode.
I'll refactor accordingly.
from deltacode.
@MaJuRG What data do you want to display in the CSV output for the moved
deltas? Current CSV columns are Type of delta
, Path
, Name
, Type
and Size
.
Here's a sample excerpt from the current JSON output for moved
and unmodified
Delta
objects:
{
"deltacode_version": "0.0.1.beta",
"deltacode_stats": {
"added": 0,
"modified": 0,
"moved": 1,
"removed": 0,
"unmodified": 7
},
"deltas": [
{
"category": "moved",
"new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
"old_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
"new_path": "b/a4.py",
"old_path": "a/a4.py",
"new_name": "a4.py",
"old_name": "a4.py",
"new_type": "file",
"old_type": "file",
"new_size": 200,
"old_size": 200
},
{
"category": "unmodified",
"path": "a/a3.py",
"name": "a3.py",
"type": "file",
"size": 200
},
from deltacode.
We can treat new_path
as the 'Path' attribute for the output. All the other fields (name type size) can be taken from the new
side.
Honestly, all we really need is the old_path
attribute in our output for moved files i.e.:
deltas: [
{
'category': 'moved',
'path': 'some/new/path/',
'old_path': 'some/old/path',
'name': 'name',
'type': 'type',
'size': 200,
}
]
just to keep it consistent for now.
from deltacode.
And corresponding columns for the CSV output?
from deltacode.
You can add it to the end for now, so it doesn't clutter up the majority of the results.
from deltacode.
OK.
from deltacode.
@MaJuRG FYI, working on the dummy Delta
objects and tests for the revised check_moved()
method, haven't yet figured out how to structure.
Have tried several approaches. For example, the following approach throws an error -- TypeError: unbound method check_moved() must be called with DeltaCode instance as first argument (got str instance instead)
:
def test_DeltaCode_check_moved_no_sha1_match(self):
added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
added_deltas = [
OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
]
removed_sha1 = '1234566c46446c29d3f45feef5419ae76fb86a5b'
removed_deltas = [
OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
]
result = deltacode.DeltaCode.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)
print('\nresult = {}\n'.format(result))
I think I'm stumbling on how to handle the self
variable in the check_moved()
method.
from deltacode.
@MaJuRG I've moved check_moved()
from a DeltaCode
method to a utility in utils.py but I'm still encountering errors with the structure of my dummy Delta
objects. Current error:
. . .
> if added_deltas[0].new_file.name == removed_deltas[0].old_file.name:
E AttributeError: 'dict' object has no attribute 'new_file'
Current test structure (similar errors with the alternative constructions of added_deltas
and removed_deltas
):
def test_DeltaCode_check_moved_1_sha1_match(self):
added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
added_deltas = [
# OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
{
'old_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''},
'category': 'removed',
'new_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200}
}
]
removed_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
removed_deltas = [
# OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
{
'old_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200},
'category': 'removed',
'new_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''}
}
]
result = utils.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)
print('\nresult = {}\n'.format(result))
from deltacode.
@johnmhoran You are using dicts instead of Delta objects. You have to create actual Delta objects in order to access .new_file
or old_file
from deltacode.
This test for example creates a single delta object:
def test_Delta_create_object_modified(self):
new = models.File({'path': 'path/modified.txt', 'sha1': 'a'})
old = models.File({'path': 'path/modified.txt', 'sha1': 'b'})
delta = deltacode.Delta(new, old, 'modified')
assert delta.new_file.path == 'path/modified.txt'
assert delta.new_file.sha1 == 'a'
assert delta.old_file.path == 'path/modified.txt'
assert delta.old_file.sha1 == 'b'
assert delta.category == 'modified'
from deltacode.
Thanks @MaJuRG -- I thought I needed to create a "hand-made" Delta
object.
from deltacode.
The basics of this have been implemented with the merge of #42. Closing.
from deltacode.
Related Issues (20)
- Add limited delta stats after running HOT 1
- Upgrade DeltaCode to Python 3 HOT 4
- Create DeltaCode documentation on ReadTheDocs HOT 6
- Linux and MacOs buid is showing some warnings in TravisCI HOT 3
- configure failed,why? HOT 10
- Adding Azure Pipelines HOT 1
- In the output content format ‘[ ’how to understand? HOT 2
- Configure failed with "file setup.py not found" HOT 11
- Azure Piplines seems to be filing for Windows Test Jobs
- Create objects to score scan information HOT 6
- Add function to handle loading 2 codebases. HOT 2
- Remove redundant Scan model HOT 1
- Remove redundant File model
- Add Dockerfile
- Separate csv formatted output in Deltacode
- Merge DeltaCode in ScanCode TK
- Update structure to use the https://github.com/nexB/skeleton
- RFC: DeltaCode next! and roadmap HOT 2
- Update documentation after deltacode gets merge in scancode-toolkit HOT 2
- License detection diffs are incorrect HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deltacode.