As described in readme, currently identical inodes can be dealt with in one of two way

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Better processing of duplicate inodes about rdfind HOT 4 OPEN

pauldreik commented on May 18, 2024 1

Better processing of duplicate inodes

from rdfind.

Comments (4)

pauldreik commented on May 18, 2024

This would perhaps work, but it complicates the most sensitive part of the deduplication and I am very scared of introducing bugs there. From the email people send after having seen the reachout in the manual page (I just love those messages!) I know rdfind is used on a lot of real systems and I don't want to upset anyone with losing their files.
If you want to help, please provide (extend the existing) unit tests to cover this case and I will feel more comfortable changing this.
Thanks, Paul

from rdfind.

kpatzsc commented on May 18, 2024

This issue affects me too. I can use -removeidentinode false to get around it, but it makes the process take a lot longer since it hashes the same file (dev/inode) for each link.

A couple ideas I thought of that might help, and may be simpler to implement than keeping a list of hardlinks for each file:

If you use rdfind repeatedly over time to deduplicate files, you'll probably end up with 1 version of the file with many links and then any new copies will have 1 link each. In this case, rdfind has to "choose" which copy to be the link source and which copy to be the link target. An enhancement to choose the copy with the most links to be the one to "keep" and then link the 1-link versions to that will reduce the occurrence of multiple hardlinked groups of identical files.
When scanning files to find duplicates (1st byte, last byte, checksums), store a table of the hashes by dev/inode, so if rdfind encounters another hardlink to a file that it has hashed already, reuse the existing hash instead of re-hashing. This would make -removeidentnode false a lot faster if a lot of hardlinked duplicates exist already, and it wouldn't affect the deduplication code.

from rdfind.

manfreddz commented on May 18, 2024

Hey @pauldreik. I agree with @Lex-2008, this would be very useful. My C++ skills aren't fantastic but after quickly glazing over the code it feels like a small change that Lex suggests. Perhaps change so that "-makehardlinks group" is a valid argument, keeping rdfind backwards-compatible. Not sure what you mean by unit test for this, but I threw together a small bash-script that tests what would be useful for me.

test_rdfind_inode_groups.sh.gz

from rdfind.

mauromol commented on May 18, 2024

I also believe that if -removeidentinode false were more efficient to avoid multiple hashing, perhaps -removeidentinode true would become useless. Then, -removeidentinode false could become the default, which would cause rdfind to behave in the expected behaviour "by default" when hard-linked files exist in the input.

from rdfind.

Better processing of duplicate inodes about rdfind HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent