Reporting performance so others can use this as a(nother) base for time-taken estimate

Related: <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

[report] rdfind measured performance on (large) HDD about rdfind HOT 2 OPEN

GerHobbelt commented on June 9, 2024

[report] rdfind measured performance on (large) HDD

from rdfind.

Comments (2)

GerHobbelt commented on June 9, 2024

For Optimizers / Developers

So this scenario had no use for #113 at all. But what MIGHT have been done?

Given the "almost 75% of the files scanned has 1 or more duplicates" plus "about 60% of 'em are dupes" could have been found in about -- <licks thumb and senses the breeze> -- half the time when the check-head and check-tail content check phases would/could have been eliminated. After all, for this particular set, those cut ~3M down to ~2.2M, so that's about 25% gain only for a noticable additional HDD access cost.

Maybe a handy commandline option to add: skip phase A+B, only do C?

But wait! What's the expected impact of such an optimization, when we do not just want a report but wish to do something about it, e.g. replace duplicates by hardlinks -- arguably one of the most costly dedup operations: it's a create-hardlink + file-delete action wrapped into one, so plenty of filesystem I/O per conversion?

Optimistically, I'ld say you'ld have to assume a cost of one more 'scan time' at least even for the cheapest action (file-delete?), while the convert-to-hardlink turned out very costly here indeed.

It would, however, be safe to say that reducing the number of rounds going through the filesystem and/or accessing files CAN be beneficial; regrettably OSes do not hand out APIs to optimize your disk access re head seek reduction/optimization, so the only way up is get everything onto SSD if you want to speed this up (seek times there are basically nil; conversion to hardlink however still requires some costly SSD page writes/clears).

In the end, is it worth it to deduplicate a large file set like this (millions of files)?

Given that only a small portion of the 41+M files on that HDD got processed in ~4 days, it might be opportune to invest in yet another HDD instead and postpone/do any mandatory deduplication in the end-point application only -- I was wondering about this as I have a large test corpus that is expected to have some (south of 10%) duplicates... 🤔 It might be "smart" to only detect&report the duplicates via rdfind -- as that is the fastest way to get intel on these, by a large margin -- and then postprocess that results.txt into something that can be parsed and used by the applications involved in processing such data.

Of course, the thriftness-lobe in my brain is on the barricades against this next revolting thought, but it may be unwise to bother deduplicating (very) large file sets, unless you're at least moderately sure the deduplication will produce significant space-saving gains, say 50% or better. Otherwise you might be better off begging your employer for more funding for your permanent storage addiction.
I had intended rdfind to check and dedup only files 100MByte or larger by specifying -minsize 100M but that apparently got parsed as -minsize 100 without an error or warning notice (bug? #134), a fact I sneakily hid in the console snapshot shown. 😉

from rdfind.

GerHobbelt commented on June 9, 2024

Related: #114

from rdfind.

[report] rdfind measured performance on (large) HDD about rdfind HOT 2 OPEN

Comments (2)

For Optimizers / Developers

In the end, is it worth it to deduplicate a large file set like this (millions of files)?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent