Comments (2)
For Optimizers / Developers
So this scenario had no use for #113 at all. But what MIGHT have been done?
Given the "almost 75% of the files scanned has 1 or more duplicates" plus "about 60% of 'em are dupes" could have been found in about -- <licks thumb and senses the breeze> -- half the time when the check-head and check-tail content check phases would/could have been eliminated. After all, for this particular set, those cut ~3M down to ~2.2M, so that's about 25% gain only for a noticable additional HDD access cost.
Maybe a handy commandline option to add: skip phase A+B, only do C?
But wait! What's the expected impact of such an optimization, when we do not just want a report but wish to do something about it, e.g. replace duplicates by hardlinks -- arguably one of the most costly dedup operations: it's a create-hardlink + file-delete action wrapped into one, so plenty of filesystem I/O per conversion?
Optimistically, I'ld say you'ld have to assume a cost of one more 'scan time' at least even for the cheapest action (file-delete?), while the convert-to-hardlink turned out very costly here indeed.
It would, however, be safe to say that reducing the number of rounds going through the filesystem and/or accessing files CAN be beneficial; regrettably OSes do not hand out APIs to optimize your disk access re head seek reduction/optimization, so the only way up is get everything onto SSD if you want to speed this up (seek times there are basically nil; conversion to hardlink however still requires some costly SSD page writes/clears).
In the end, is it worth it to deduplicate a large file set like this (millions of files)?
Given that only a small portion of the 41+M files on that HDD got processed in ~4 days, it might be opportune to invest in yet another HDD instead and postpone/do any mandatory deduplication in the end-point application only -- I was wondering about this as I have a large test corpus that is expected to have some (south of 10%) duplicates... 🤔 It might be "smart" to only detect&report the duplicates via rdfind
-- as that is the fastest way to get intel on these, by a large margin -- and then postprocess that results.txt
into something that can be parsed and used by the applications involved in processing such data.
Of course, the thriftness-lobe in my brain is on the barricades against this next revolting thought, but it may be unwise to bother deduplicating (very) large file sets, unless you're at least moderately sure the deduplication will produce significant space-saving gains, say 50% or better. Otherwise you might be better off begging your employer for more funding for your permanent storage addiction.
I had intended rdfind
to check and dedup only files 100MByte or larger by specifying -minsize 100M
but that apparently got parsed as -minsize 100
without an error or warning notice (bug? #134), a fact I sneakily hid in the console snapshot shown. 😉
from rdfind.
Related: #114
from rdfind.
Related Issues (20)
- Option to skip files with different attributes
- This should never happen. FIXME! HOT 1
- Question: Does rdfind prioritise symlinks ?
- Is rdfind safe for sha1 (or other) checksum collisions? HOT 4
- error compiling 'numberic_limits' is not a member of 'std' HOT 2
- Permission check for results.txt should be made at start of the run HOT 1
- Read arbitrary number of directory/file paths from standard input or a file
- mtime switch
- `-minsize 100M` is parsed as `-minsize 100` without a peep
- Option to specify specific offset when finding dupes HOT 1
- how deleteduplicates option works? HOT 1
- how to change behaviour of deleteduplicates option HOT 1
- Faster hashes
- Changes to hard link in dry run mode HOT 1
- Use tabs instead of spaces in result.txt file HOT 1
- [Feature Request] Docker image
- Any way to exclude certain multiple folders when checking for duplicates ? HOT 2
- Add warning before attemping hardlink operations on extfat filesystems (which does not support them)
- fixme error message - This should never happen. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rdfind.