Comments (4)
This would perhaps work, but it complicates the most sensitive part of the deduplication and I am very scared of introducing bugs there. From the email people send after having seen the reachout in the manual page (I just love those messages!) I know rdfind is used on a lot of real systems and I don't want to upset anyone with losing their files.
If you want to help, please provide (extend the existing) unit tests to cover this case and I will feel more comfortable changing this.
Thanks, Paul
from rdfind.
This issue affects me too. I can use -removeidentinode false to get around it, but it makes the process take a lot longer since it hashes the same file (dev/inode) for each link.
A couple ideas I thought of that might help, and may be simpler to implement than keeping a list of hardlinks for each file:
-
If you use rdfind repeatedly over time to deduplicate files, you'll probably end up with 1 version of the file with many links and then any new copies will have 1 link each. In this case, rdfind has to "choose" which copy to be the link source and which copy to be the link target. An enhancement to choose the copy with the most links to be the one to "keep" and then link the 1-link versions to that will reduce the occurrence of multiple hardlinked groups of identical files.
-
When scanning files to find duplicates (1st byte, last byte, checksums), store a table of the hashes by dev/inode, so if rdfind encounters another hardlink to a file that it has hashed already, reuse the existing hash instead of re-hashing. This would make -removeidentnode false a lot faster if a lot of hardlinked duplicates exist already, and it wouldn't affect the deduplication code.
from rdfind.
Hey @pauldreik. I agree with @Lex-2008, this would be very useful. My C++ skills aren't fantastic but after quickly glazing over the code it feels like a small change that Lex suggests. Perhaps change so that "-makehardlinks group" is a valid argument, keeping rdfind backwards-compatible. Not sure what you mean by unit test for this, but I threw together a small bash-script that tests what would be useful for me.
test_rdfind_inode_groups.sh.gz
from rdfind.
I also believe that if -removeidentinode false
were more efficient to avoid multiple hashing, perhaps -removeidentinode true
would become useless. Then, -removeidentinode false
could become the default, which would cause rdfind to behave in the expected behaviour "by default" when hard-linked files exist in the input.
from rdfind.
Related Issues (20)
- Disable checksum filtering HOT 2
- request for option to avoid detection of duplicates within same source tree HOT 3
- README Caveats regarding collapsing multiple existing hardlinks contains confusing detail HOT 2
- Add -min-file-size option HOT 1
- Option to skip files with different attributes
- This should never happen. FIXME! HOT 1
- Question: Does rdfind prioritise symlinks ?
- Is rdfind safe for sha1 (or other) checksum collisions? HOT 4
- error compiling 'numberic_limits' is not a member of 'std' HOT 2
- Permission check for results.txt should be made at start of the run HOT 1
- Read arbitrary number of directory/file paths from standard input or a file
- mtime switch
- [report] rdfind measured performance on (large) HDD HOT 2
- `-minsize 100M` is parsed as `-minsize 100` without a peep
- Option to specify specific offset when finding dupes HOT 1
- how deleteduplicates option works? HOT 1
- how to change behaviour of deleteduplicates option HOT 1
- Faster hashes
- Changes to hard link in dry run mode HOT 1
- Use tabs instead of spaces in result.txt file HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rdfind.