Code Monkey home page Code Monkey logo

Comments (4)

pauldreik avatar pauldreik commented on May 18, 2024

This would perhaps work, but it complicates the most sensitive part of the deduplication and I am very scared of introducing bugs there. From the email people send after having seen the reachout in the manual page (I just love those messages!) I know rdfind is used on a lot of real systems and I don't want to upset anyone with losing their files.
If you want to help, please provide (extend the existing) unit tests to cover this case and I will feel more comfortable changing this.
Thanks, Paul

from rdfind.

kpatzsc avatar kpatzsc commented on May 18, 2024

This issue affects me too. I can use -removeidentinode false to get around it, but it makes the process take a lot longer since it hashes the same file (dev/inode) for each link.

A couple ideas I thought of that might help, and may be simpler to implement than keeping a list of hardlinks for each file:

  1. If you use rdfind repeatedly over time to deduplicate files, you'll probably end up with 1 version of the file with many links and then any new copies will have 1 link each. In this case, rdfind has to "choose" which copy to be the link source and which copy to be the link target. An enhancement to choose the copy with the most links to be the one to "keep" and then link the 1-link versions to that will reduce the occurrence of multiple hardlinked groups of identical files.

  2. When scanning files to find duplicates (1st byte, last byte, checksums), store a table of the hashes by dev/inode, so if rdfind encounters another hardlink to a file that it has hashed already, reuse the existing hash instead of re-hashing. This would make -removeidentnode false a lot faster if a lot of hardlinked duplicates exist already, and it wouldn't affect the deduplication code.

from rdfind.

manfreddz avatar manfreddz commented on May 18, 2024

Hey @pauldreik. I agree with @Lex-2008, this would be very useful. My C++ skills aren't fantastic but after quickly glazing over the code it feels like a small change that Lex suggests. Perhaps change so that "-makehardlinks group" is a valid argument, keeping rdfind backwards-compatible. Not sure what you mean by unit test for this, but I threw together a small bash-script that tests what would be useful for me.

test_rdfind_inode_groups.sh.gz

from rdfind.

mauromol avatar mauromol commented on May 18, 2024

I also believe that if -removeidentinode false were more efficient to avoid multiple hashing, perhaps -removeidentinode true would become useless. Then, -removeidentinode false could become the default, which would cause rdfind to behave in the expected behaviour "by default" when hard-linked files exist in the input.

from rdfind.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.