This will need some thinking, but we will want some way to tell if a file has been 'mo

[<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Handle 'moved' files,about nexb/deltacode

Comments (21)

johnmhoran commented on July 23, 2024

[@mjherzog comment] In most move cases, I would expect many lower-level path elements to be the same so we are really looking at path similarities from file name back up the tree.

from deltacode.

johnmhoran commented on July 23, 2024

[@MaJuRG comment] If a file has the same filename and sha1 but not same path, it should be considered moved. we will have to consider collisions in this case though.

from deltacode.

steven-esser commented on July 23, 2024

Moved files may be in either the Add or Removed buckets, so this would be a good place to start when thinking about a possible fix.

from deltacode.

johnmhoran commented on July 23, 2024

@MaJuRG I’ve deleted the modified code in DeltaCode.determine_delta() and created a new method, DeltaCode.determine_moved(), which uses a dictionary of removed files, keyed by their sha1 attribute, to identify and move matching files from the added and removed categories to a new moved category.

The DeltaCode constructor and the DeltaCode.get_stats() and DeltaCode.to_dict() methods now include the moved category, as does Delta.to_dict(). I’ve modified the latter to display attributes for the new_file/old_file to enable the user to distinguish which moved files would otherwise have been categorized as added or removed, but the resulting JSON display is a bit verbose. Here’s a sample excerpt from the JSON output for a file that would otherwise have been categorized as added:

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 3,
    "removed": 0,
    "unmodified": 7
  },
  "deltas": [
    {
      "category": "moved",
      "new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "old_sha1": "",
      "new_path": "b/a4.py",
      "old_path": "",
      "new_name": "a4.py",
      "old_name": "",
      "new_type": "file",
      "old_type": "",
      "new_size": 200,
      "old_size": ""
    },
. . .

Perhaps a better approach would be a single new attribute denoting added/removed or new/old. I haven’t yet modified the CSV-generation code but will once we’ve settled on the column headings and data to display for the moved files.

In addition to fixing a group of failing tests, I’ve added three new tests. I also plan to add some CSV tests once we’ve settled on what to display for the moved files, and I’d appreciate your suggestions on additional tests I could add to thoroughly vet the determine_moved() method.

I’m available for an uberconf whenever it’s convenient for you.

from deltacode.

steven-esser commented on July 23, 2024

Ok, let me play around with this on my end.

from deltacode.

steven-esser commented on July 23, 2024

@johnmhoran After playing around with this, one of the first things I noticed is that we should have a single 'moved' object for an added/removed pair.

It looks like we have two separate moved objects being created at this time. For how we do output, etc this is not ideal. our 'moved' delta object should have both the new_ and old_ attributes filled out.

Hopefully that makes sense.

from deltacode.

johnmhoran commented on July 23, 2024

@MaJuRG That was my original goal, but after working with my 3 test codebases I realized that pairing an added and removed is not always straightforward. For example, suppose a.py is removed from the a directory in old codebase and added to the b and the c directories in the new codebase. Do we pair old a/a.py with new b/a.py or new c/a.py? And what do we do with the second added copy of a.py?

It was in light of this potential complexity that I chose to treat each added and removed file as a separate moved file, thinking that we could then display them to the user as potential pairs by their common sha1, name and size attributes. How would you suggest handling this scenario?

from deltacode.

steven-esser commented on July 23, 2024

We should probably just focus on the easiest cases for now, and pair/create moved objects for those.

This means, after we index, we should only look at places where len(added_index[sha1]) == 1 and len(removed_index[sha1]) == 1. Then we know that this is the only place on both sides that this file can exist, so it is a singular move.

Once we can reliable calculate moves for the most simplest of cases, we can think about how to handle these other scenarios.

from deltacode.

steven-esser commented on July 23, 2024

In other words, because we are indexing by sha1, there is the potential to have an arbitrary number of files that have the same sha1 value. Hence why it is not straightforward to handle moves, as you mentioned above.

So if we only look at the places where sha1 index corresponds to a single file, then we dont have to worry.

Yes, this will not catch all the possible moves, but it will get a good number of them.

from deltacode.

johnmhoran commented on July 23, 2024

I'll refactor accordingly.

from deltacode.

johnmhoran commented on July 23, 2024

@MaJuRG What data do you want to display in the CSV output for the moved deltas? Current CSV columns are Type of delta, Path, Name, Type and Size.

Here's a sample excerpt from the current JSON output for moved and unmodified Delta objects:

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 1,
    "removed": 0,
    "unmodified": 7
  },
  "deltas": [
    {
      "category": "moved",
      "new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "old_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "new_path": "b/a4.py",
      "old_path": "a/a4.py",
      "new_name": "a4.py",
      "old_name": "a4.py",
      "new_type": "file",
      "old_type": "file",
      "new_size": 200,
      "old_size": 200
    },
    {
      "category": "unmodified",
      "path": "a/a3.py",
      "name": "a3.py",
      "type": "file",
      "size": 200
    },

from deltacode.

steven-esser commented on July 23, 2024

We can treat new_path as the 'Path' attribute for the output. All the other fields (name type size) can be taken from the new side.

Honestly, all we really need is the old_path attribute in our output for moved files i.e.:

deltas: [
  {
    'category': 'moved',
    'path': 'some/new/path/',
    'old_path': 'some/old/path',
    'name': 'name',
    'type': 'type',
    'size': 200,
  }
]

just to keep it consistent for now.

from deltacode.

johnmhoran commented on July 23, 2024

And corresponding columns for the CSV output?

from deltacode.

steven-esser commented on July 23, 2024

You can add it to the end for now, so it doesn't clutter up the majority of the results.

from deltacode.

johnmhoran commented on July 23, 2024

OK.

from deltacode.

johnmhoran commented on July 23, 2024

@MaJuRG FYI, working on the dummy Delta objects and tests for the revised check_moved() method, haven't yet figured out how to structure.

Have tried several approaches. For example, the following approach throws an error -- TypeError: unbound method check_moved() must be called with DeltaCode instance as first argument (got str instance instead):

    def test_DeltaCode_check_moved_no_sha1_match(self):
        added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        added_deltas = [
                OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
            ]
        removed_sha1 = '1234566c46446c29d3f45feef5419ae76fb86a5b'
        removed_deltas =  [
                OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
            ]

        result = deltacode.DeltaCode.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)

        print('\nresult = {}\n'.format(result))

I think I'm stumbling on how to handle the self variable in the check_moved() method.

from deltacode.

johnmhoran commented on July 23, 2024

@MaJuRG I've moved check_moved() from a DeltaCode method to a utility in utils.py but I'm still encountering errors with the structure of my dummy Delta objects. Current error:

. . .
>       if added_deltas[0].new_file.name == removed_deltas[0].old_file.name:
E       AttributeError: 'dict' object has no attribute 'new_file'

Current test structure (similar errors with the alternative constructions of added_deltas and removed_deltas):

    def test_DeltaCode_check_moved_1_sha1_match(self):
        added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        added_deltas = [
                # OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
                {
                'old_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''},
                'category': 'removed',
                'new_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200}
                }
            ]
        removed_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        removed_deltas =  [
                # OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
                {
                'old_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200},
                'category': 'removed',
                'new_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''}
                }
            ]

        result = utils.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)

        print('\nresult = {}\n'.format(result))

from deltacode.

steven-esser commented on July 23, 2024

@johnmhoran You are using dicts instead of Delta objects. You have to create actual Delta objects in order to access .new_file or old_file

from deltacode.

steven-esser commented on July 23, 2024

This test for example creates a single delta object:

def test_Delta_create_object_modified(self):
        new = models.File({'path': 'path/modified.txt', 'sha1': 'a'})
        old = models.File({'path': 'path/modified.txt', 'sha1': 'b'})

        delta = deltacode.Delta(new, old, 'modified')

        assert delta.new_file.path == 'path/modified.txt'
        assert delta.new_file.sha1 == 'a'
        assert delta.old_file.path == 'path/modified.txt'
        assert delta.old_file.sha1 == 'b'
        assert delta.category == 'modified'

from deltacode.

johnmhoran commented on July 23, 2024

Thanks @MaJuRG -- I thought I needed to create a "hand-made" Delta object.

from deltacode.

steven-esser commented on July 23, 2024

The basics of this have been implemented with the merge of #42. Closing.

from deltacode.

Handle 'moved' files about deltacode HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent